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Preface to the Second Edition 


Microeconometrics Using Stata, published in December 2008, was written 
for Stata 10.1. Microeconometrics Using Stata, Revised Edition, published in 
January 2010, was written for Stata 11.0. This second edition is written for 
Stata 17. 


Whereas the scope and coverage of the preceding editions were 
reasonably synchronized with our own Microeconometrics: Methods and 
Applications (Cambridge, 2005), this second edition has broader scope in 
several respects. We have attempted not only to update our previous 
coverage to bring it in line with newer tools in the latest edition of Stata but 
also to bring into the book many topics and methods that are now actively 
studied and increasingly used in applied microeconometrics. This coverage 
includes several topics, listed below, that were not covered in our 2005 text. 


This second edition covers over ten years of both enhancements to Stata 
and developments in the methods most commonly used in empirical 
microeconometrics analysis. The focus of the book remains the use of linear 
and nonlinear regression methods for cross-sectional and short panel data. In 
particular, we give only short treatment to other features of Stata that are 
useful for data analysis such as data management, use within Stata of other 
programming languages such as Python, and automated document 
preparation. The new edition is much expanded and is split into two 
volumes. 


The first volume, comprising chapters 1-15 and Stata and Mata 
appendixes, focuses on the linear regression model and provides a brief 
introduction to nonlinear regression models. This volume is an expanded 
version of chapters 1—10, 12—13, and the appendixes of the first and revised 
editions. In places, there is greater explanation of underlying methods, and 
much of the first volume is intended to be suitable for an advanced 
undergraduate course in addition to serving graduate students and 
researchers. 


The second volume, comprising chapters 16—30, covers the standard 
nonlinear models as well as more advanced and more recent material. In 
addition to updated versions of chapters 14—18 of the first edition and the 
revised edition, the second volume includes new chapters on duration 
models, treatment effects in randomized control trials, treatment effects with 
endogenous treatments, parametric models for endogeneity and 
heterogeneity, spatial regression, semiparametric regression, machine 
learning and prediction, and Bayesian methods. 


Some methods we cover are well established. Other methods we present 
are in areas of active research, so they may become replaced by better 
methods. In particular, many methods for causal analysis using observational 
or experimental methods are still being established and improved upon, at a 
remarkably rapid pace. This includes inference for instrumental variables 
with weak instruments, cluster—robust inference with few clusters, treatment- 
effects estimation with heterogeneous treatment effects, regression 
discontinuity design, and causal analysis using machine learning methods. 
Accordingly, we plan to periodically add some supplementary material on 
the book’s website (http://cameron.econ.ucdavis.edu/musz2). 


Our target user base consists of practitioners of applied 
microeconometrics. This group is quite diverse in terms of familiarity with 
the available econometric tools. In deference to such diversity, we have 
chosen to separate the more advanced aspects of many topics and place them 
in different parts of the book. This is a challenging task because often the 
same material could, and in some cases should, appear in several alternative 
places. To assist the reader, we have provided numerous cross-references and 
a much lengthier subject index. The reader will benefit from checking out 
these connections. 


Datasets and the do-files used in this book are available on the Stata 
Press website at https://www.stata-press.com/data/mus2.html. Any 
corrections to the book will be documented at https://www.stata- 
press.com/books/microeconometrics-stata/. 


The preparation of this second edition has benefited from generous help 
from many sources. We thank our colleagues, coauthors, students, and many 
users of the previous editions for their suggested improvements, for reading 


parts of the book, for permission to use datasets developed in joint research, 
and for encouragement to proceed with the project. We have benefited from 
presenting some of the material in various short courses around the world 
and from positive feedback from readers of the earlier editions that 
encouraged writing this updated edition. Colin Cameron would especially 
like to thank Shu Shen, Takuya Ura, Oscar Jorda, Marianne Bitler, the 
broader econometrics and empirical microeconomics community at the 
University of California—Davis, and Doug Miller and Adrian Pagan. Pravin 
Trivedi gratefully acknowledges the support provided by the School of 
Economics, University of Queensland. We thank Yulia Marchenko and 
Nikolay Balov for very detailed comments on the Bayesian chapters, and 
Kristin MacDonald for a careful reading of the final draft of the book. We 
thank David Culwell for his excellent editing and Stephanie White for 
managing the LaTeX formatting and production of this book. Most 
especially, both authors acknowledge their debt of gratitude to David 
Drukker for extensive feedback on many aspects of the material in the book 
throughout this project, including a complete reading, as well as feedback on 
the substantive aspects of applying the econometric and statistical tools. 
Finally, we thank our respective families for their patience and 
understanding during the long gestation period of the evolution of this 
project. 


Davis, CA A. Colin Cameron 
Charlottesville, VA Pravin K. Trivedi 
June 2022 


Preface to the First Edition 


This book explains how an econometrics computer package, Stata, can be 
used to perform regression analysis of cross-section and panel data. The 
term microeconometrics is used in the book title because the applications 
are to economics-related data and because the coverage includes methods 
such as instrumental-variables regression that are emphasized more in 
economics than in some other areas of applied statistics. However, many 
issues, models, and methodologies discussed in this book are also relevant 
to other social sciences. 


The main audience is graduate students and researchers. For them, this 
book can be used as an adjunct to our own Microeconometrics: Methods 
and Applications (Cameron and Trivedi 2005), as well as to other graduate- 
level texts such as Greene (2008) and Wooldridge (2010). By comparison to 
these books, we present little theory and instead emphasize practical aspects 
of implementation using Stata. More advanced topics we cover include 
quantile regression, weak instruments, nonlinear optimization, bootstrap 
methods, nonlinear panel-data methods, and Stata’s matrix programming 
language, Mata. 


At the same time, the book provides introductions to topics such as 
ordinary least-squares regression, instrumental-variables estimation, and 
logit and probit models so that it is suitable for use in an undergraduate 
econometrics class, as a complement to an appropriate undergraduate-level 
text. The following table suggests sections of the book for an introductory 
class, with the caveat that in places formulas are provided using matrix 
algebra. 


Stata basics Chapter 1.1-1.4 


Data management Chapter 2.1—2.4, 2.6 
OLS Chapter 3.1-3.6 
Simulation Chapter 4.6—4.7 
Generalized least squares (heteroskedasticity) Chapter 5.3 
Instrumental variables Chapter 6.2-6.3 
Linear panel data Chapter 8 

Logit and probit models Chapter 14.1—14.4 
Tobit model Chapter 16.1—16.3 


Although we provide considerable detail on Stata, the treatment is by no 
means complete. In particular, we introduce various Stata commands but 
avoid detailed listing and description of commands as they are already well 
documented in the Stata manuals and online help. Typically, we provide a 
pointer and a brief discussion and often an example. 


As much as possible, we provide template code that can be adapted to 
other problems. Keep in mind that to shorten output for this book, our 
examples use many fewer regressors than necessary for serious research. 
Our code often suppresses intermediate output that is important in actual 
research, because of extensive use of command quietly and options nolog, 
nodots, and noheader. And we minimize the use of graphs compared with 
typical use in exploratory data analysis. 


We have used Stata 10, including Stata updates. Instructions on how to 
obtain the datasets and the do-files used in this book are available on the 
Stata Press website at https://www.stata-press.com/data/mus.html. Any 
corrections to the book will be documented at https://www.stata- 
press.com/books/mus. html. 


We have learned a lot of econometrics, in addition to learning Stata, 
during this project. Indeed, we feel strongly that an effective learning tool 
for econometrics is hands-on learning by opening a Stata dataset and seeing 
the effect of using different methods and variations on the methods, such as 
using robust standard errors rather than default standard errors. This method 
is beneficial at all levels of ability in econometrics. Indeed, an efficient way 


of familiarizing yourself with Stata’s leading features might be to execute 
the commands in a relevant chapter on your own dataset. 


We thank the many people who have assisted us in preparing this book. 
The project grew out of our 2005 book, and we thank Scott Parris for his 
expert handling of that book. Juan Du, Qian Li, and Abhyit Ramalingam 
carefully read many of the book chapters. Discussions with John Daniels, 
Oscar Jorda, Guido Kuersteiner, and Doug Miller were particularly helpful. 
We thank Deirdre Patterson for her excellent editing and Lisa Gilmore for 
managing the LaTeX formatting and production of this book. Most 
especially, we thank David Drukker for his extensive input and 
encouragement at all stages of this project, including a thorough reading 
and critique of the final draft, which led to many improvements in both the 
econometrics and Stata components of this book. Finally, we thank our 
respective families for making the inevitable sacrifices as we worked to 
bring this multiyear project to completion. 


Davis, CA A. Colin Cameron 
Bloomington, IN Pravin K. Trivedi 
October 2008 


1- To see whether you have the latest update, type update query. For those with 
earlier versions of Stata, some key changes are the following: Stata 9 introduced the 
matrix programming language, Mata. The syntax for Stata 10 uses the vce (robust) 
option rather than the robust option to obtain robust standard errors. A mid-2008 
update of version 10 introduced new random-number functions, such as runiform() 
and rnormal CJs 


Chapter 1 
Stata basics 


This chapter provides some of the basic information about issuing 
commands in Stata. Sections 1.1-1.3 enable a first-time user to begin using 
Stata interactively. In this book, we instead emphasize storing these 
commands in a text file, called a Stata do-file, that is then executed. This is 
presented in section 1.4. Sections 1.5-1.8 present more advanced Stata 
material that might be skipped on a first reading. 


The chapter concludes with a summary of some commonly used Stata 
commands and with a template do-file that demonstrates many of the tools 
introduced in this chapter. Chapters 2 and 3 then demonstrate many of the 
Stata commands and tools used in applied microeconometrics. Additional 
features of Stata are introduced throughout the book and the appendixes. 


1.1 Interactive use 


Interactive use means that Stata commands are initiated from within Stata. 


A graphical user interface (GUI) for Stata is available. This enables 
almost all Stata commands to be selected from drop-down menus. 
Interactive use is then especially easy because there is no need to know in 
advance the Stata command. 


All implementations of Stata allow commands to be directly typed in; 
for example, entering summarize yields summary statistics for the current 
dataset. This is the primary way that Stata is used because it is considerably 
faster than working through drop-down menus. Furthermore, for most 
analyses, the standard procedure is to aggregate the various commands 
needed into one file called a do-file (see section 1.4) that can be run with or 
without interactive use. We therefore provide little detail on the Stata GUI. 


For new Stata users, we suggest entering Stata, usually by clicking on 
the Stata icon, opening one of the Stata example datasets, and doing some 
basic statistical analysis. To obtain example data, select File > Example 
datasets..., meaning from the File menu, select the entry Example 
datasets.... Then, click on the link to Example datasets installed with 
Stata. Work with auto.dta; this is used in many of the introductory 
examples presented in the Stata documentation. First, select describe to 
obtain descriptions of the variables in the dataset. Second, select use to read 
the dataset into Stata. You can then obtain summary statistics either by 
typing summarize in the Command window or by selecting Statistics > 
Summaries, tables, and tests > Summary and descriptive statistics > 
Summary statistics. You can run a simple regression by typing 
regress mpg weight or by selecting Statistics > Linear models and 
related > Linear regression and then using the drop-down lists in the 
Model tab to choose mpg as the dependent variable and weight as the 
independent variable. 


The Stata manual [GsM] Getting Started with Stata for Mac, or its Unix 
and Windows versions [GSU] Getting Started with Stata for Unix and [Gsw] 


Getting Started with Stata for Windows, is very helpful, especially 
[Gs] 1 Introducing Stata—sample session, which uses typed-in 
commands, and [Gs] 2 The Stata user interface. 


The extent to which you use Stata in interactive mode is really a 
personal preference. There are several reasons for at least occasionally 
using interactive mode. First, it can be useful for learning how to use Stata. 
Second, it can be useful for exploratory analysis of datasets because you 
can see in real time the effect of, for example, adding or dropping 
regressors. If you do this, however, be sure to first start a session log file 
(see section 1.4) that saves the commands and resulting output. Third, you 
can use help and related commands to obtain online information about 
Stata commands. Fourth, one way to implement the preferred method of 
running do-files is to use the Stata Do-file Editor in interactive mode. 


Finally, components of a given version of Stata, such as version 17, are 
periodically updated. Entering update determines the current update level 
and provides the option to install official updates to Stata. You can also 
install community-contributed commands in interactive mode once the 
relevant software is located by using, for example, the search command. 


1.2 Documentation 


Stata documentation is extensive; you can find it in Stata (online) or on the 
web. 


1.2.1 Stata manuals 


For first-time users, see [GSM] Getting Started with Stata for Mac or [Gsu] 
Getting Started with Stata for Unix or [GSw] Getting Started with Stata for 
Windows. The most useful manual is [U] User's Guide. Entries within 
manuals are referred to using shorthand such as [U] 11.1.4 in range, which 
denotes section 11.1.4 of [u] Users Guide on the topic in range. 


Many commands are described in [R] Base Reference Manual. Not all 
Stata commands appear here, however, because some appear instead in the 
appropriate topical reference manual. These topical reference manuals are 
[BAYES] Bayesian Analysis, [CM] Choice Models, [D] Data Management, 
[DSGE] Dynamic Stochastic General Equilibrium Models, [ERM] Extended 
Regression Models, [FMM] Finite Mixture Models, [FN] Functions, [G] 
Graphics, RT] [tem Response Theory, [Lasso] Lasso, [M] Mata, [META] 
Meta-Analysis, [ME] Multilevel Mixed Effects, [Mi] Multiple-Imputation, 
[Mv] Multivariate Statistics, [P] Programming, [PSS] Power, Precision, and 
Sample Size, [RPT] Reporting, [SEM] Structural Equation Modeling, [sP] 
Spatial Autoregressive Models, [ST] Survival Analysis, [SVY] Survey Data, 
[TE] Treatment-Effects, [TABLES] Customizable Tables and Collected 
Results, [TS] Time-Series, and [xT] Longitudinal-Data/Panel-Data. For 
example, the generate command appears in [D] generate rather than in [R]. 


For a complete list of documentation, see [U] 1 Read this—it will help 
and also [1] Index. 


1.2.2 Additional Stata resources 
The Stata Journal (SJ) and its predecessor, the Stata Technical Bulletin 


(STB), present examples and code that go beyond the current installation of 
Stata. sJ articles over three years old and all STB articles are available online 


from the Stata website at no charge. You can find this material by using 
various Stata help commands given later in this section, and you can often 
install code as a free community-contributed command. 


The Stata website has a lot of information. This includes a summary of 
what Stata does. A good place to begin is https://www.stata.com/support/. 
In particular, see the answers to frequently asked questions (FAQ) and video 
tutorials. The blog posts at https://blog.stata.com include many useful 
detailed Stata examples. 


The University of California-Los Angeles website 
https://stats.oarc.ucla.edu/stata/ provides many introductory tutorials. 


1.2.3 The help command 


Stata has extensive help available once you are in the program. 


The help command is most useful if you already know the name of the 
command for which you need help. For example, for help on the regress 
command, type 


. help regress 


(output omitted ) 


Note that here and elsewhere the dot (.) is not typed in but is provided 
to enable distinction between Stata commands (preceded by a dot) and 
subsequent Stata output, which appears with no dot. 


The help command is also useful if you know the class of commands 
for which you need help. For example, for help on functions, type 


. help function 


(output omitted ) 


Often, however, you need to start with the basic help contents 
command, which will open the Viewer window shown in figure 1.1. 


. help contents 


Figure 1.1. Basic help contents 
For further details, click on a category and subsequent subcategories. 


For help with the Stata matrix programming language, Mata, add the 
term mata after help. Often, for Mata, it is necessary to start with the very 
broad command 


. help mata 
(output omitted ) 


and then narrow the results by selecting the appropriate categories and 
subcategories. 


1.2.4 The search and net search commands 


The search command does a keyword database and Internet search. It is 
especially useful if you do not know the Stata command name or if you 
want to find the many places that a command or method might be used. The 
default for search is to obtain all available information from official Stata 
help files, books, blogs, FAQ, examples, the sJ, and the STB, and Stata code 
from some Internet sources, notably the Statistical Software Components 
Archive maintained by the Boston College Department of Economics. For 
example, for ordinary least squares (OLS), the command 


. search ols 


(output omitted ) 


finds references in the manuals [R], [BAYES], [Mv], and [Svy]; in books; in 
blogs; in FAQ; in examples; in the sJ and the sTB; and finds many packages 
on the web. It also gives help commands that you can click on to get further 
information without the need to consult the manuals. 


The default of the search command is to find entries that match all the 
keywords provided. To instead find occurrence of any of the keywords 
provided, use the or option. For example, typing 


. search weak instr, or 


(output omitted ) 


finds joint occurrences of any words beginning with the letters “weak” and 
the letters “instr”. 


The search command has several other options, including the net 
option, which searches only the Internet for installable packages, including 
code from the sJ, the STB and the Statistical Software Components Archive. 


1.3 Command syntax and operators 


Stata command syntax describes the rules of the Stata programming 
language. 


1.3.1 Basic command syntax 


The basic command syntax is almost always some subset of 


| prefix: | command. | varlist | |= exp | [ of | [ in | | weight | [ using filename | 
E options | 


The brackets denote qualifiers that in most instances are optional. Words 
in the typewriter font are to be typed into Stata as they appear on the page. 
Italicized words are to be substituted by the user, where 


prefix denotes a command that repeats execution of command or 
modifies the input or output of command with, 

e command denotes a Stata command, 

e varlist denotes a list of variable names, 

e exp is a mathematical expression, 

e ifidentifies observations via an expression 

e in denotes a range of observations, 

e weight denotes a weighting expression, 

e filename is a filename, and 

options denotes one or more options that apply to command. 


The greatest variation across commands is in the available options. 
Commands can have many options, and these options can also have options, 
which are given in parentheses. 


Stata is case sensitive. We generally use lowercase throughout, though 
occasionally we use uppercase for model names. 


Commands and output are displayed following the style for Stata 
manuals. For Stata commands given in the text, the typewriter font is used. 
For example, for OLS, we use the regress command. For displayed 


commands and output, the commands have the prefix . (a period followed 
by a space), whereas output has no prefix. For Mata commands, the 
displayed prefix is a colon (:) rather than a period. Output from commands 
that span more than one line has the continuation prefix > (greater-than sign). 
For a Stata or Mata progran, the lines within the program do not have a 
prefix. 


1.3.2 Example: The summarize command 


The summarize command provides descriptive statistics (for example, mean, 
standard deviation) for one or more variables. 


You can obtain the syntax of summarize by typing help summarize. This 
yields output including 


summarize | varlist | lif | [ in | | weight | 3 options | 


It follows that, at the minimum, we can give the command without any 
qualifiers. Unlike some commands, summarize does not use [= exp] or 
[using filename]. 


As an example, we use a commonly used, illustrative dataset installed 
with Stata called auto. dta, which has information on various attributes of 
74 automobiles. You can read this dataset into memory by using the sysuse 
command, which accesses Stata-installed datasets. To read in the data and 
obtain descriptive statistics, we type 


* Use a dataset that was installed with Stata 


sysuse auto 
(1978 automobile data) 


summarize 

Variable Obs Mean Std. dev. Min Max 

make (0) 
price 74 6165.257 2949.496 3291 15906 
mpg 74 21.2973 5.785503 12 41 
rep78 69 3.405797 . 9899323 1 5 
headroom 74 2.993243 .8459948 1.5 5 
trunk 74 13.75676 4.277404 5 23 
weight 74 3019.459 777.1936 1760 4840 
length 74 187.9324 22.26634 142 233 
turn 74 39.64865 4.399354 31 51 
displacement 74 197.2973 91.83722 79 425 
gear_ratio 74 3.014865 .4562871 2.19 3.89 
foreign 74 . 2972973 -4601885 (0) 1 


The dataset comprises 12 variables for 74 automobiles. The average price of 
the automobiles is $6,165, and the standard deviation is $2,949. The column 
Obs gives the number of observations for which data are available for each 

variable. The make variable has 0 observations because it is a string (or text) 
variable giving the make of the automobile, and summary statistics are not 

applicable to a nonnumeric variable. The rep78 variable is available for only 
69 of the 74 observations. 


A more focused use of summarize restricts attention to selected variables 
and uses one or more of the available options. For example, 


* Summary statistics of selected variables 
summarize mpg price weight, separator(1) 


Variable Obs Mean Std. dev. Min Max 
mpg 74 21.2973 5.785503 12 41 
price 74 6165.257 2949.496 3291 15906 
weight 74 3019.459 TTT .1936 1760 4840 


provides descriptive statistics for the mpg, price, and weight variables. The 
option separator (1) inserts a line between the output for each variable. 


1.3.3 Example: The regress command 


The regress command implements OLS regression. 


You can obtain the syntax of regress by typing help regress. This 


yields output including 


regress depvar | indepvars | [ of | [ in | [ weight | bs options | 


It follows that, at the minimum, we need to include the variable name for the 
dependent variable (in that case, the regression is on an intercept only). 
Although not explicitly stated, prefixes can be used. Many estimation 


commands have similar syntax. 


Suppose that we want to run an OLS regression of the mpg variable (fuel 
economy in miles per gallon) on price (auto price in dollars) and weight 
(weight in pounds). The basic command is simply 


. * OLS regression 
. regress mpg price weight 


Source SS df MS Number of obs = 74 
F(2, 71) = 66.85 

Model 1595 . 93249 2 797.966246 Prob > F = 0.0000 
Residual 847 .526967 71 11.9369995 R-squared = 0.6531 
Adj R-squared = 0.6434 

Total 2443 . 45946 73 33.4720474 Root MSE = 3.455 
mpg | Coefficient Std. err. t P>|t| [95% conf. interval] 
price - .0000935 .0001627 -0.57 0.567 -.000418 . 0002309 
weight -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862 


_cons 39.43966 1.621563 


24.32 0.000 


36 .20635 42.67296 


The coefficient of — 0.0058175 for weight implies that fuel economy falls 
by 5.8 miles per gallon when the car’s weight increases by 1,000 pounds. 


A more complicated version of regress that demonstrates much of the 


command syntax is the following: 


. by foreign: regress mpg price weight if weight < 4000, vce(robust) 


(output omitted ) 


For each value of the foreign variable, here either 0 or 1, this command fits 
distinct OLS regressions of mpg on price and weight. The if qualifier limits 
the sample to cars with weight less than 4,000 pounds. The vce (robust) 
option leads to heteroskedasticity-robust standard errors being used. 


The by prefix is an example of a command prefix that repeats execution 
of the subsequent command and is usually followed by a colon. The 
command help prefix lists the available command prefixes, including by, 
bysort, bayes, bootstrap, simulate, svy, and quietly. 


Output from commands is not always desired. We can suppress output by 
using the quietly prefix. For example, 


. * Suppress output from command 
. quietly regress mpg price weight 


The quietly prefix does not require a colon, for historical reasons, even 
though it is a command prefix. In this book, we use this prefix extensively to 
suppress extraneous output, abbreviated to qui to enable more commands to 
fit in one line. 


The preceding examples used one of the available options for regress. 
From help regress, we find that the regress command has the following 
options: noconstant, hascons, tsscons, vce (vcetype), level (#), beta, 
eform(string), depname (varname), display_options, noheader, notable, 
plus, msel, and coeflegend. 


1.3.4 Factor variables 


Factor variables enable reference to a set of indicator variables based on a 
(nonnegative and integer-valued) categorical variable by inserting the 

i. operator in front of the name of the categorical variable. Factor variables 
can be used in the variable list of most Stata commands. 


As an example, consider the variable rep78, the repair record in 1978. 
This takes five distinct values that are 1, 2, 3, 4, and 5, though any other 
nonnegative integer values will do. Additionally, variable rep78 is missing 
for five observations. We have 


. * Factor variables for rep78 
summarize i.rep78 


Variable Obs Mean Std. dev. Min Max 
rep78 

1 69 0289855 . 1689948 0 1 

2 69 . 115942 . 3225009 (0) 1 

3 69 . 4347826 . 4993602 0 1 

4 69 . 2608696 . 4423259 0 1 

5 69 . 1594203 . 3687494 0 1 

We can also include factor variables in regression commands. 
. * Factor variables for rep78 in regression 
. regress mpg i.rep78 
Source SS df MS Number of obs 69 
F(4, 64) = 4.91 
Model 549.415777 4 137.353944 Prob > F = 0.0016 
Residual 1790.78712 64 27.9810488 R-squared = 0.2348 
Adj R-squared = 0.1869 
Total 2340 . 2029 68 34.4147485 Root MSE = 5.2897 
mpg | Coefficient Std. err. t P>|t | [95% conf. interval] 
rep78 

2 -1.875 4.181884 -0.45 0.655 -10.22927 6.479274 
3 -1.566667 3.863059 -0.41 0.686 -9.284014 6.150681 
4 .6666667 3.942718 0.17 0.866 -7 . 209818 8.543152 
5 6.363636 4.066234 1.56 0.123 -1.759599 14.48687 
_cons 21 3.740391 5.61 0.000 13.52771 28 . 47229 


The default with regression commands is to omit one category, that for 
the lowest value taken by the categorical variable. For variable rep78, this is 
the value 1. To see what category is the base (or omitted) category, add the 
allbaselevels option after the command (here regress). To change the 


base category, use the ib. operator instead of the i. operator. For example, 
the regress mpg ib2.rep78 command will omit the category rep78 = 2, 
and the regress mpg ib(last) .rep78 command will omit the highest- 
valued category (here rep78 = 5). Alternatively, the command fvset base 
5 rep78 will permanently set the fifth category to be the base category. 


A complete set of indicators, with no category omitted, is included using 
the ibn. operator with the hascons option, which omits the intercept. For 


example, 


. * Factor variables for rep78 - no category is omitted 
. regress mpg ibn.rep78, hascons 


Source SS df MS Number of obs = 
F(4, 64) = 
Model 549.415777 4 137.353944 Prob > F = 
Residual 1790.78712 64 27.9810488  R-squared = 
Adj R-squared = 
Total 2340. 2029 68 34.4147485 Root MSE = 
mpg | Coefficient Std. err. t P>|t | [95% conf. 
rep78 

1 21 3.740391 5.61 0.000 13.52771 

2 19.125 1.870195 10.23 0.000 15.38886 

3 19 . 43333 . 9657648 20.12 0.000 17.504 

4 21.66667 1.246797 17.38 0.000 19.1759 

5 27.36364 1.594908 17.16 0.000 24.17744 


69 
4.91 
0.0016 
0.2348 
0.1869 
5.2897 


interval] 


28 . 47229 
22.86114 
21.36267 
24.15743 
30.54983 


A complete set of interactions between two (or more) categorical 


variables can be created using the # operator. For example, consider an 
interaction between categorical variable rep78 and categorical variable 


foreign (a binary indicator). We have 


. * Factor variables for interaction between two categorical variables 
. regress mpg i.rep78#i.foreign, allbaselevels 

note: lib.rep78#1.foreign identifies no observations in the sample. 
note: 2.rep78#1.foreign identifies no observations in the sample. 


Source SS df MS Number of obs = 69 
F(7, 61) = 4.88 
Model 839.550121 7 119.935732 Prob > F = 0.0002 
Residual 1500.65278 61 24.6008652 R-squared = 0.3588 
Adj R-squared = 0.2852 
Total 2340 . 2029 68 34.4147485 Root MSE = 4.9599 
mpg | Coefficient Std. err. t P>|tl [95% conf. interval] 
rep78#foreign 
1#Domestic O (base) 
1#Foreign O (empty) 
2#Domestic -1.875 3.921166 -0.48 0.634 -9.715855 5.965855 
2#Foreign O (empty) 
3#Domestic =2 3.634773 -0.55 0.584 -9.268178 5.268178 
3#Foreign 2.333333 4.527772 0.52 0.608 -6.720507 11.38717 
4#Domestic -2.555556 3.877352 -0.66 0.512 -10.3088 5.19769 
4#Foreign 3.888889 3.877352 1.00 0.320 -3.864357 11.64213 
5#Domestic 11 4.959926 2.22 0.030 1.082015 20.91798 
5#Foreign 5.333333 3.877352 1.38 0.174 -2.419912 13.08658 
_cons 21 3.507197 5.99 0.000 13.98693 28.01307 


Here the base (omitted) category is rep78 = 1 and foreign = 0 (the 
lowest-valued joint category). Additionally, there are zero observations 
falling into two of the categories: rep78 = 1 and foreign = 1, and 
rep78 = 2 and foreign = 1. 


The ## operator creates a factorial interaction that includes sets of 
indicator variables for each of the two categorical variables, in addition to 
the interactions given by the # operator. For example, the command regress 
mpg i.rep78##i. foreign is equivalent to the command regress mpg 
i.rep78 i.foreign i.rep78#i.foreign. 


Factor-variable operators can also be used to create interactions between 
indicator variables and continuous regressors. In that case, the prefix 
c. needs to be used to signal that the interaction is with a continuous 
variable. For example, 


. * Factor variables for interaction between categorical and continuous variables 
. regress mpg i.rep78#c.weight 


Source SS df MS Number of obs 69 
F(5, 63) = 24.03 
Model 1535.21253 5 307.042506 Prob > F = 0.0000 
Residual 804.99037 63 12.7776249 R-squared = 0.6560 
Adj R-squared = 0.6287 
Total 2340 . 2029 68 34.4147485 Root MSE = 3.5746 
mpg | Coefficient Std. err. t P>|t| [95% conf. interval] 
rep/8#c.weight 
1 - .0056832 .0010153 -5.60 0.000 -.0077122 -.0036542 
2 -.0058149 .0006781 -8.57 0.000 -.00717 -.0044597 
3 -.005717 .0005886 -9.71 0.000 -.0068932 -.0045409 
4 -.0057904 .0006745 -8.58 0.000 -.0071383 -.0044424 
5 -.0051682 .0009273 -5.57 0.000 -.0070212 -.0033151 
_cons 38.51076 1.926584 19.99 0.000 34.66078 42.36073 


In this continuous interaction example, there is no omitted category—all five 
possible values of rep78 are interacted with the continuous variable weight. 


Factor-variable operators also permit interaction of continuous variables 


with continuous variables. For example, the following performs OLS 


regression of mpg on price and a quadratic in weight. 


. * Factor variables for interaction between two continuous variables 
. regress mpg price c.weight c.weight#c.weight, noheader 


mpg 


price 
weight 


c.weight# 
c.weight 


-cons 


-.0002597 


1 


.016047 


. 72e-06 


54.66807 


Coefficient Std. err. 


.0001696 
. 0040403 


6.71e-07 


6.150716 


8.89 


P>|t| 
0.130 
0.000 
0.013 


0.000 


[95% conf. 
-.000598 
-.024105 
3.79e-07 


42 .40086 


interval] 
.0000786 
-.0079889 
3.06e-06 


66 .93529 


In total, there are five factor-variable operators: i., c., o., #, and ##. The 
o. Operator is used to omit a continuous variable (for example, o.price) or 
an indicator variable (for example, 05. rep78 to omit the indicator variable 


rep78=5). 


For more on factor variables, type help factor variables, or see 
[u] 11.4.3 Factor variables and [u] 26 Working with categorical data and 
factor variables. To check whether the regress command, for example, 
supports factor variables, type the command help regress and the output 
below the syntax summary includes a note that “indepvars may contain 
factor variables; see fvvarlist.” 


1.3.5 Abbreviations, case sensitivity, and wildcards 


Commands and parts of commands can be abbreviated to the shortest string 
of characters that uniquely identify them, often just two or three characters. 
For example, we can shorten summarize to su. For expositional clarity, we 
do not use such abbreviations in this book; two notable exceptions are that 
we use qui rather than quietly and that we may use abbreviations in the 
options to graphics commands because these commands can get very 
lengthy. Not using abbreviations makes it much easier to read your do-files. 


Variable names can be up to 32 characters long, where the characters can 
be A-Z, a-z, any Unicode letter, 0—9, and _ (underscore). Some names, such 
as in, are reserved. Stata is case sensitive, and the norm is to use lowercase. 


We can use the wildcard * (asterisk) for variable names in commands, 
provided there is no ambiguity such as two potential variables for a one- 
variable command. For example, 


. * Wildcard (asterisk) example 
. summarize t* 


Variable Obs Mean Std. dev. Min Max 
trunk 74 13.75676 4.277404 5 23 
turn 74 39.64865 4.399354 31 51 


provides summary statistics for all variables with names beginning with the 
letter t. Where ambiguity may arise, wildcards are not permitted. 


1.3.6 Arithmetic, relational, and logical operators 


The arithmetic operators in Stata are + (addition), - (subtraction), * 
(multiplication), / (division), ^ (raised to a power), and the prefix - 
(negation). For example, to compute and display — 2 x {9/(8 + 2 — 7)}?, 
which simplifies to — 2 x 32, we type 


. * Arithmetic example and display of result 
. display -2*(9/(8+2-7))~2 
-18 


If the arithmetic operation is not possible, or data are not available to 
perform the operation, then a missing value denoted by . is displayed. For 
example, 


. * Missing value created by impossible arithmetic operation 
. display 2/0 


The relational operators are > (greater than), < (less than), >= (greater 
than or equal), <= (less than or equal), == (equal), and ! = (not equal). These 
are the obvious symbols, except that a pair of equal signs is used for equality 
and != denotes not equal. Relational operators are often used in if qualifiers 
that define the sample for analysis. 


Logical operators return 1 for true and 0 for false. The logical operators 
are s (and), | (or), and ! (not). The operator ~ can be used in place of !. 
Logical operators are also used to define the sample for analysis. For 
example, to restrict regression analysis to smaller less expensive cars, type 


* Example of logical operators 
. regress mpg price weight if weight <= 4000 & price <= 10000 


(output omitted ) 


The string operator + is used to concatenate two strings into a single, 
longer string. 


The order of evaluation of all operators is ! (or ~), ^, - (negation), /, *, - 
(subtraction), +, != (or ~=), >, <, <=, >=, ==, &, and |. 


1.3.7 Error messages 


Stata produces error messages when a command fails. These messages are 
brief, but a fuller explanation can be obtained from the manual or directly 
from Stata. 


For example, if we regress mpg on notthere but the notthere variable 
does not exist, we get 


. regress mpg notthere 
variable notthere not found 
r(111); 


Here r(111) denotes return code 111. You can obtain further details by 
clicking on r (111) or, if in interactive mode, by typing 


. search rc iii 


(output omitted ) 


1.4 Do-files and log files 


For Stata analysis requiring many commands, or requiring lengthy 
commands, it is best to collect all the commands into a program (or script) 
that is stored in a text file called a do-file. 


In this book, we perform data analysis using a do-file. We assume that 
the do-file and, if relevant, any input and output files are in a common 
directory and that Stata is executed from that directory. Then, we need to 
provide only the filename rather than the complete directory structure. For 
example, we can refer to a file as mus202data.dta rather than 
C:\mus\chapter2\mus202data.dta. 


1.4.1 Writing a do-file 


A do-file is a text file with extension .do that contains a series of Stata 
commands. 


As an example, we write a two-line program that reads in the Stata 
example dataset auto.dta and then presents summary statistics for the mpg 
variable, which we already know is in the dataset. The commands are 
sysuse auto.dta, clear, where the clear option is added to remove the 
current dataset from memory, and summarize mpg. The two commands are 
to be collected into a command file called a do-file. The filename should 
include no spaces, and the file extension is .do. 


In this example, we suppose this file is given the name example.do and 
is stored in the current working directory. 


To see the current directory, type cd without any arguments. To change 
to another directory, type cd with an argument. For example, in Windows, 
to change to the directory c:\Program Files\Statal7\, we type 


. cd "c:\Program Files\Stata17" 
c:\Program Files\Stata17 


The directory name is given in double quotes because it includes spaces. 
Otherwise, the double quotes are unnecessary. 


One way to create the do-file is to start Stata and use the Do-file Editor. 
Within Stata, we select Window > Do-file Editor > New Do-file Editor, 
type in the commands, and save the do-file. 


Alternatively, type in the commands outside Stata by using a preferred 
text editor. Ideally, this text editor supports multiple windows, reads large 
files (datasets or output), and gives line numbers and column numbers. 


The type command lists the contents of the previously created file. We 
have 


. type example.do 
sysuse auto, clear 
summarize mpg 


1.4.2 Running do-files 


You can run (or execute) an already-written do-file by using the Command 
window. Start Stata, and, in the Command window, change directory (cd) to 
the directory that has the do-file, and then issue the do command. We obtain 


. do example 


. sysuse auto, clear 
(1978 automobile data) 


. summarize mpg 
Variable Obs Mean Std. dev. Min Max 


mpg 74 21.2973 5.785503 12 41 


end of do-file 


where we assume that example.do is in directory C:\Program 
Files\Statal7\. 


An alternative simpler method is to run the do-file from the Do-file 
Editor. Select Window > Do-file Editor > New Do-file Editor, select File 
> Open... and the appropriate file, and finally select Tools > Execute (do). 


An advantage to using the Do-file Editor is that you can highlight or select 
just part of the do-file and then execute this part by selecting Tools > 
Execute selection (include). In either case, you can more simply hit the 
execute icon (the rightmost icon). This method is especially useful when 
developing a program because it is easy to make modifications in the Do- 
file Editor and reexecute. 


Finally, and often most simply, to execute example.do, for example, 
double-click on example.do in File Explorer. This initiates Stata and opens 
example.do in the Do-file Editor. Furthermore, this sets the working 
directory to be the directory that example.do was in. 


1.4.3 Log files 


By default, Stata output is sent to the screen. For reproducibility, you should 
save this output in a separate file. Another advantage to saving output is 
that lengthy output can be difficult to read on the screen; it can be easier to 
review results by viewing an output file using a text editor. 


A Stata output file is called a log file. It stores the commands in addition 
to the output from these commands. The default extension for a plain text 
log file is .10g, but you can choose an alternative extension, such as .txt. 
An extension name change may be worthwhile because several other 
programs, such as LaTeX compilers, also create files with the .10g 
extension. Log files can be read either as standard text or in a special Stata 
code called smc1 (Stata Markup and Control Language). We use text 
throughout this book because it is easier to read in a text editor. A useful 
convention can be to give the log the same filename as that for the do-file. 
For instance, for example.do, we save the output as example.txt. 


A log file is created by using the 10g command. In a typical analysis, 
the do-file will change over time, in which case the output file will also 
change. The Stata default is to protect against an existing log being 
accidentally overwritten. To create a log file in text form named 
example.txt, you usually type 


. * Create log file as a text file 
. log using example.txt, text replace 


The replace option permits the existing version of example.txt, if there is 
one, to be overwritten. Without replace, Stata will refuse to open the log 
file if there is already a file called example.txt. 


In some cases, we may not want to overwrite the existing log, in which 
case we would not specify the replace option. The most likely reason for 
preserving a log is that it contains important results, such as those from 
final analysis. Then it can be good practice to rename the log after analysis 
is complete. Thus, example.txt might be renamed examp1e07052021.txt. 


When a program is finished, you should close the log file by typing log 


close. 


The log can be very lengthy. If you need a hard copy, you can edit the 
log to include only essential results. The text editor you use should use a 
monospace font such as Courier New, where each character takes up the 
same space, so that output table columns will be properly aligned. 


The log file includes the Stata commands, with a dot (.) prefix, and the 
output. You can use a log file to create a do-file, if a do-file does not already 
exist, by deleting the dot and all lines that are command results (no dot). By 
this means, you can do initial work using the Stata GUI and generate a do- 
file from the session, provided that you created a log file at the beginning of 
the session. The cmdlog command creates a file that contains just the typed 
commands. 


1.4.4 A three-step process 


Data analysis using Stata can repeatedly use the following three-step 
process: 


1. Create or change the do-file. 
2. Execute the do-file in Stata. 
3. Read the resulting log with a text editor. 


If the Stata Do-file Editor is used, then one can execute highlighted lines of 
code, rather than the entire do-file. 


The initial do-file can be written by editing a previously written do-file 
that is a useful template or starting point, especially if it uses the same 
dataset or the same commands as the current analysis. The resulting log 
may include Stata errors or estimation results that lead to changes in the 
original do-file and so on. 


Suppose we have fit several models and now want to fit an additional 
model. In interactive mode, we would type in the new command, execute it, 
and see the results. Using the three-step process, we add the new command 
to the do-file, execute the do-file or a subcomponent, and read the new 
output. Because many Stata programs execute in seconds, this adds little 
extra time compared with using interactive mode, and it has the benefit of 
having a do-file that can be modified for later use. 


1.4.5 Comments and long lines 


Stata do-files can include comments. This can greatly increase 
understanding of a program, which is especially useful if you return to a 
program and its output a year or two later. Lengthy single-line comments 
can be allowed to span several lines, ensuring readability. There are several 
ways to include comments: 


e For single-line comments, begin the line with an asterisk (*); Stata 
ignores such lines. 

e For a comment on the same line as a Stata command, use two slashes 
(//) after the Stata command. 

e For multiple-line comments, place the commented text between slash- 
star (/*) and star-slash (*/). 


The Stata default is to view each line as a separate Stata command, 
where a line continues until a carriage return (end-of-line or Enter key) is 
encountered. Some commands, such as those for nicely formatted graphs, 
can be very long. For readability, these commands need to span more than 
one line. The easiest way to break a line at, say, the 70th column is by using 
three slashes (///) and then continuing the command on the next line. 


The following do-file code includes several comments to explain the 
program and demonstrates how to allow a command to span more than one 
line. 


* Demonstrate use of comments 

* This program reads in system file auto.dta and gets summary statistics 
clear // Remove data from memory 

* The next code shows how to allow a single command to span two lines 
sysuse /// 

auto 

summarize 


For long commands, you can alternatively use the #delimit command. 
This changes the delimiter from the Stata default, which is a carriage return 
(that is, end of line), to a semicolon. This also permits more than one 
command on a single line. The following code changes the delimiter from 
the default to a semicolon and back to the default: 


* Change delimiter from cr to semicolon and back to cr 

#delimit ; 

* More than one command per line and command spans more than one line; 
clear; sysuse 

auto; summarize; 

#delimit cr 


We recommend using /// instead of changing the delimiter because the 
comment method produces more readable code. 


1.4.6 Different implementations of Stata 


The different platforms for Stata share the same command syntax; however, 
commands can change across versions of Stata. For this book, we use 

Stata 17. To ensure that later versions of Stata will continue to work with 
our code, we include the version 17 command near the beginning of the 
do-file. 


Different implementations of Stata have different limits on, for 
example, the maximum number of variables in a dataset. These maximum 
possible values vary with the edition of Stata: Stata/BE, Stata/SE, or 
Stata/MP. The help limits command provides details on the limits for the 


current implementation of Stata. The query and creturn list commands 
detail the current settings that may be below these limits. 


Current settings can be increased or decreased with the set command, 
provided this does not lead to the maximum limit being exceeded. For 
example, 


. set maxvar 10000 


sets the maximum number of variables in a dataset to 10,000. 


1.5 Scalars and matrices 


Scalars can store a single number or a single string, and matrices can store 
several numbers or strings as an array. We provide a very brief introduction 
here, sufficient for use of the scalars and matrices in section 1.6. 


1.5.1 Scalars 


A scalar can store a single number or string. You can display the contents of 
a scalar by using the display command. 


For example, to store the number 2 x 3 as the scalar a and then display 
the scalar, we type 


. * Scalars: Example 
. scalar a = 2*3 


. scalar b = "2 times 3 = " 


. display ba 
2 times 3 = 6 


One common use of scalars, detailed in section 1.6, is to store the scalar 
results of estimation commands that can then be accessed for use in 
subsequent analysis. In section 1.7, we discuss the relative merits of using a 
scalar or a macro to store a scalar quantity. Scalars can have the same name 
as variables, in which case variables take precedence. 


1.5.2 Matrices 


Stata provides two distinct ways to use matrices, both of which store several 
numbers or strings as an array. One way is through Stata commands that 
have the matrix prefix. The second way is a much more powerful matrix 
programming language, Mata. These two methods are presented in, 
respectively, appendixes A and B. 


The following Stata code illustrates the definition of a specific 2 x 3 
matrix, the listing of the matrix, and the extraction and display of a specific 
element of the matrix. 


* Matrix commands: Example 
. matrix define A = (1,2,3 \ 4,5,6) 
. matrix list A 
A[2,3] 

ci c2 c3 

ri 1 2 3 
r2 4 5 6 

scalar c = A[2,3] 


. display c 
6 


1.6 Using results from Stata commands 


One goal of this book is to enable analysis that uses more than just official 
Stata commands and printed output. Much of this additional analysis entails 
further computations after using Stata commands. 


1.6.1 Using results from the r-class command summarize 


The Stata commands that analyze the data but do not estimate parameters are 
r-class commands. All r-class commands save their results in r(). The 
contents of r () vary with the command and are listed by typing return 
list. 


As an example, we list the results stored after using summarize: 


. * Illustrate use of return list for r-class command summarize 
. Summarize mpg 


Variable Obs Mean Std. dev. Min Max 


mpg 74 21.2973 5.785503 12 41 


. return list 


scalars: 
r(N) = 74 
r(sum_w) = 74 
r(mean) = 21.2972972972973 
r(Var) = 33.47204738985561 
r(sd) = 5.785503209735141 
r(min) = 12 
r(max) = 41 
r(sum) = 1576 


There are eight separate results stored as Stata scalars with the names r (N), 
r(sum_w),..., r (sum). These are fairly obvious aside from r (sum w), which 
gives the sum of the weights. Several additional results are returned if the 
detail option to summarize is used; see [R] Summarize. 


The following code calculates and displays the range of the data: 


* Illustrate use of r() 
qui summarize mpg 


scalar range = r(max) - r(min) 


. display "Sample range = " range 
Sample range = 29 


The results in r() disappear when a subsequent r-class or e-class 
command is executed. We can always save the value as a scalar. It can be 
particularly useful to save the sample mean. 


* Save a result in r() as a scalar 


scalar mpgmean = r(mean) 


1.6.2 Using results from the e-class command regress 


Estimation commands are e-class commands (or estimation-class 
commands), such as regress. The results are stored in e (), the contents of 
which you can view by typing ereturn list. 


A leading example is regress for OLS regression. For example, after you 


type 

. regress mpg price weight 
Source SS df MS Number of obs = 74 
F(2, 71) = 66.85 
Model 1595 .93249 2 797.966246 Prob > F = 0.0000 
Residual 847 .526967 71 11.9369995 R-squared = 0.6531 
Adj R-squared = 0.6434 
Total 2443 . 45946 73 33.4720474 Root MSE = 3.455 
mpg | Coefficient Std. err. t P>|tl [95% conf. interval] 
price - .0000935 0001627 -0.57 0.567 -.000418 . 0002309 
weight -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862 
_cons 39.43966 1.621563 24.32 0.000 36.20635 42.67296 


ereturn list yields 


* ereturn list after e-class command regress 


. ereturn list 


scalars: 
e(N) 74 
e(df_m) 2 
e(df_r) T71 
e(F) = 66.84814256414501 
e(r2) . 6531446579233134 
e(rmse) 3. 454996314099513 
e (mss) 1595 .932492798133 
e(rss) 847 .5269666613265 
e(r2_a) . 6433740849070687 
e(11) -195.2169813478502 
e(11_0) -234 .3943376482347 
e(rank) = 3 
macros: 
e(cmdline) "regress mpg price weight" 
e(title) "Linear regression" 
e(marginsok) "XB default" 
e(vce) "ols" 
e(depvar) "mpg" 
e (cmd) "regress" 
e(properties) "b V" 
e(predict) "regres_p" 
e (model) "ols" 
e(estat_cmd) "regress_estat" 
matrices: 
e(b) 1x3 
e(V) 3x3 
functions: 
e(sample) 


The key numeric output on sums of squares and degrees of freedom 
given in the analysis-of-variance table are stored as scalars. As an example 
of using scalar results, consider the calculation of R2. The model sum of 
squares is stored in e (mss), and the residual sum of squares is stored in 
e (rss), So that 


. * Use of e() where scalar 
. scalar r2 = e(mss)/(e(mss)+te(rss) ) 


. display "r-squared = " r2 
r-squared = .65314466 


The result is the same as the 0.6531 given in the original regression output. 


The remaining numeric output are stored as matrices. Here we present 
methods to extract scalars from these matrices and manipulate them. 
Specifically, we obtain the OLS coefficient of price from the 1 x 3 matrix 
e(b) and the estimated variance of this estimate from the 3 x 3 matrix e (v), 
and then we form the ¢ statistic for testing whether the coefficient of price 
is 0: 

. * Use of e() where matrix 
. Matrix best = e(b) 
. scalar bprice = best[1,1] 


. matrix Vest = e(V) 


. scalar Vprice = Vest[1,1] 
. scalar tprice = bprice/sqrt(Vprice) 


. display "t statistic for HO: b_price = 0 is " tprice 
t statistic for HO: b_price = 0 is -.57468079 


The result is the same as the — 0.57 given in the original regression output. 


Stata 16 and subsequent versions permit use of matrix subscripts and 
application of matrix functions returning scalars to r() and e() matrices. For 
example, 


. * Direct use of matrix subscripts for e() matrices 
. display e(b)[1,1]/sqrt(e(V) [1,1]) 
-.57468079 


The results in e() disappear when a subsequent e-class command is 
executed. However, you can save the results by using estimates store, 
detailed in section 3.5.6. 


1.7 Global and local macros 


A macro is a string of characters that stands for another string of characters. 
For example, you can use the macro xlist in place of "price weight". This 
substitution can lead to code that is shorter, is easier to read, and can be 
easily adapted to similar problems. 


Macros can be global or local. A global macro is accessible across Stata 
do-files or throughout a Stata session. A local macro can be accessed only 
within a given do-file or in the interactive session. 


1.7.1 Global macros 


Global macros are the simplest macro and are adequate for many purposes. 
We use global macros extensively throughout this book. 


Global macros are defined with the global command. To access what 
was stored in a global macro, put the character s immediately before the 
macro name. For example, consider a regression of the dependent variable 
mpg on several regressors, where the global macro xlist is used to store the 
regressor list. 


. * Global macro definition and use 
. global xlist price weight 


. regress mpg $xlist, noheader // $ prefix for global macro is necessary 
mpg | Coefficient Std. err. t P>|t | [95% conf. interval] 
price - .0000935 0001627 -0.57 0.567 -.000418 . 0002309 
weight -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862 
_cons 39.43966 1.621563 24.32 0.000 36.20635 42.67296 


Global macros are frequently used when fitting several different models 
with the same regressor list because they ensure that the regressor list is the 
same in all instances and they make it easy to change the regressor list. A 
single change to the global macro changes the regressor list in all instances. 


A second example might be where several different models are fit, but 
we want to hold a key parameter constant throughout. For example, suppose 


we obtain standard errors by using the bootstrap. Then we might define the 
global macro nbreps for the number of bootstrap replications. Exploratory 
data analysis might set nbreps to a small value such as 50 to save 
computational time, whereas final results set nbreps to an appropriately 
higher value such as 400. 


A third example is to highlight key program parameters, such as the 
variable used to define the cluster if cluster-robust standard errors are 
obtained. By gathering all such global macros at the start of the program, we 
can know what the settings are for key program parameters. 


1.7.2 Local macros 


Local macros are defined with the local command. To access what was 
stored in the local macro, enclose the macro name in single quotes. These 
quotes differ from how they appear on this printed page. On most keyboards, 
the left quote is located at the upper left, under the tilde, and the right quote 
is located at the middle right, under the double quote. 


As an example of a local macro, consider a regression of the mpg variable 
on several regressors. We define the local macro x1ist and subsequently 
access its contents by enclosing the name in single quotes as ‘xlist’. 


. * Local macro definition and use 
. local xlist "price weight" 


. regress mpg “xlist’, noheader // Single quotes are necessary 
mpg | Coefficient Std. err. t P>|t| [95% conf. interval] 
price -. 0000935 .0001627 -0.57 0.567 -.000418 . 0002309 
weight -.0058175 0006175 -9.42 0.000 -.0070489 -.0045862 
_cons 39 . 43966 1.621563 24.32 0.000 36 . 20635 42.67296 


The double quotes used in defining the local macro as a string are 
unnecessary, which is why we did not use them in the earlier global macro 
example. Using the double quotes does emphasize that a text substitution has 
been made. The single quotes in subsequent references to xlist are 
necessary. 


We could also use a macro to define the dependent variable. For 
example, 


. * Local macro definition without double quotes 
. local y mpg 


. regress `y“ “xlist”, noheader 


mpg | Coefficient Std. err. t P>|t| [95% conf. interval] 
price -.0000935 0001627 -0.57 0.567 -.000418 . 0002309 
weight -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862 
_cons 39.43966 1.621563 24.32 0.000 36.20635 42.67296 


Note that here ‘y’ is not a variable with N observations. Instead, it is the 
string mpg. The regress command simply replaces ‘y’ with the text mpg, 
which in turn denotes a variable that has N observations. 


We can also define a local macro through evaluation of a function. For 
example, 


* Local macro definition through function evaluation 
. local z = 2+2 


. display `z 
4 


leads to ‘z’ being the string 4. Using the equality sign when defining a 
macro causes the macro to be evaluated as an expression. For numerical 
expressions, using the equality sign stores the result of the expression and 
not the characters in the expression itself in the macro. For string 
assignments, it is best not to use the equality sign. This is especially true 
when storing lists of variables in macros. 


Local macros are especially useful for programming in Stata; see 
appendix A. Then, for example, you can use ‘y’ and ‘x’ as generic notation 
for the dependent variable and regressors, making the code easier to read. 


Local macros apply only to the current program and have the advantage 
of no potential conflict with other programs. They are preferred to global 
macros, unless there is a compelling reason to use global macros. 


1.7.3 Scalar or macro? 


A macro can be used in place of a scalar, but a scalar is simpler. 
Furthermore, [P] scalar points out that using a scalar will usually be faster 
than using a macro because a macro requires conversion into and out of 
internal binary representation. This reference also gives an example where 
macros lead to a loss of accuracy because of these conversions. 


One drawback of a scalar, however, is that the scalar is dropped 
whenever clear all is used. By contrast, a macro is still retained. Consider 
the following example: 


. * Scalars disappear after clear all but macro does not 
. global b 3 


. local c 4 
scalar d = 5 
clear 


. display $b _skip(3) `c? // Display macros 
3 4 


display d // Display the scalar 


. Clear all 


. display $b skip(3) `c” // Display macros 
3 4 


. display d // Display the scalar 
d not found 
r(111); 


Here the scalar a has been dropped after clear a11, though not after clear. 


We use global macros in this text because there are cases in which we 
want the contents of our macros to be accessible across do-files. A second 
reason for using global macros is that the required $ prefix makes it clear 
that a global parameter is being used. 


1.8 Looping commands 


Loops provide a way to repeat the same command many times. We use 
loops in a variety of contexts throughout the book. 


Stata has three looping constructs: foreach, forvalues, and while. The 
foreach construct loops over items in a list, where the list can be a list of 
variable names (possibly given in a macro) or a list of numbers. The 
forvalues construct loops over consecutive values of numbers. A while 
loop continues until a user-specified condition is not met. 


We illustrate how to use these three looping constructs in creating the 
sum of four variables, where each variable is created from the uniform 
distribution. There are many variations in the way you can use these loop 
commands; see [P] foreach, [P] forvalues, and [P] while. 


The generate command is used to create a new variable. The 
runiform() function provides a draw from the uniform distribution. 
Whenever random numbers are generated, we set the seed to a specific 
value with the set seed command so that subsequent runs of the same 
program lead to the same random numbers being drawn. We have, for 
example, 


* Make artificial dataset of 100 observations on 4 uniform variables 
clear 


set obs 100 
Number of observations (_N) was 0, now 100. 


set seed 10101 
. generate xivar = runiform() 
. generate x2var = runiform() 
. generate x3var = runiform() 


. generate x4var = runiform() 


We want to sum the four variables. The obvious way to do this is 


. * Manually obtain the sum of four variables 
. generate sum = xivar + x2var + x3var + x4var 


. summarize sum 


Variable Obs Mean Std. dev. Min Max 


sum 100 1.935471 .5911068 .4193314 3.349523 


We now present several ways to use loops to progressively sum these 
variables. Although only four variables are considered here, the same 
methods can potentially be applied to hundreds of variables. 


1.8.1 The foreach loop 


We begin by using foreach to loop over items in a list of variable names. 
Here the list is x1lvar, x2var, x3var, and x4var. 


The variable ultimately created will be called sum. Because sum already 
exists, we need to first drop sum and then generate sum=0. The replace 
sum=0 command collapses these two steps into one step, and the quietly 
prefix suppresses output stating that 100 observations have been replaced. 
Following this initial line, we use a foreach loop and additionally use 
quietly within the loop to suppress output following replace. The 
program is 
. * foreach loop with a variable list 
. qui replace sum = 0 


. foreach var of varlist xivar x2var x3var x4var { 


2. qui replace sum = sum + `var’ 
3. } 
. summarize sum 
Variable | Obs Mean Std. dev. Min Max 
sum | 100 1.935471 .5911068 .4193314 3.349523 


The result is the same as that obtained manually. 


The preceding code is an example of a program (see appendix A) with 
the { brace appearing at the end of the first line and the } brace appearing 
on its own at the last line of the program. The numbers 2. and 3. do not 


actually appear in the program but are produced as output. In the foreach 
loop, we refer to each variable in the variable list var1ist by the local 
macro named var, so that ‘var’ with single quotes is needed in subsequent 
uses of var. The choice of var as the local macro name is arbitrary, and 
other names can be used. For a variable list, the word varlist is necessary. 
Types of lists other than variable lists are possible, in which case we use 
numlist, newlist, global, Or local; see [P] foreach. 


An attraction of using a variable list is that the method can be applied 
when variable names are not sequential. For example, the variable names 
could have been incomehusband, incomewife, incomechildi, and 


incomechild2. 
1.8.2 The forvalues loop 


A forvalues loop iterates over consecutive values. In the following code, 
we let the index be the local macro i, and ‘i’ with single quotes is needed 
in subsequent uses of i. The program 


* forvalues loop to create a sum of variables 
. qui replace sum = 0 


. forvalues i= 1/4 f{ 


2. qui replace sum = sum + x i’var 
3. } 
summarize sum 
Variable Obs Mean Std. dev. Min Max 


sum 100 1.935471 5911068 4193314 3.349523 


produces the same result. 


The choice of the name i for the local macro was arbitrary. In this 
example, the increment is one, but you can use other increments. For 
example, if we use forvalues i = 1(2)11, then the index goes from 1 to 
11 in increments of 2. 


1.8.3 The while loop 


A while loop continues until a condition is no longer met. This method is 
used when foreach and forvalues cannot be used. For completeness, we 
apply it to the summing example. 


In the following code, the local macro i is initialized to 1 and then 
incremented by 1 in each loop; looping continues, provided that i < 4. 


. * While loop and local macros to create a sum of variables 
. qui replace sum = 0 


local i 1 
. while `i’ <= 4 { 
2. qui replace sum = sum + x i’var 
3. local i = `i’ + 1 
4. } 
. summarize sum 
Variable | Obs Mean Std. dev. Min Max 
sum | 100 1.935471 .5911068 .4193314 3.349523 


1.8.4 The continue command 


The continue command provides a way to prematurely cease execution of 
the current loop iteration. This may be useful if, for example, the loop 
includes taking the log of a number and we want to skip this iteration if the 
number is negative. Execution then resumes at the start of the next loop 
iteration, unless the break option is used. For details, see help continue. 


1.9 Mata and Python in Stata 


Stata can enact commands written in programming languages other than 
Stata. In particular, Stata has a separate matrix programming language, 
Mata. One can enter and exit Mata from Stata. Additionally, commands in 
Mata can be implemented within Stata using the prefix mata:, and 
commands in Stata can be implemented in Mata using the prefix stata:. 
Appendix B details the Mata programming language, and many Mata 
examples are given in this book. 


Stata version 16 introduced the python command, which provides 
similar facility for the Python language. As of Stata 17, Stata can be called 
from Python via the pystata Python package. 


More generally, the plugin command permits execution within Stata of 
compiled libraries written in other programming languages. The javacall 
command handles the special case of Java plugins. With the java 
command, Java code can be executed directly from within Stata. 


1.10 Some useful commands 


We have mentioned only a few Stata commands. See [U] 28 Commands 
everyone should know for a list of key commands that everyone will find 


useful. 


1.11 Template do-file 


The following do-file provides a template. It captures most of the features 
of Stata presented in this chapter, aside from looping commands. 


* 1. Program name 

* mus201p2template.do written 12/01/2021 is a template do-file 
* 2. Write output to a log file 

log using mus 201p2template.txt, text replace 

* 3. Stata version 

version 17 // So will still run in a later version of Stata 
* 4. Program explanation 

* This illustrative program creates 100 uniform variates 

* 5. Change Stata default settings - one example is given 
set linesize 82 // Set the maximum width of Stata output 
* 6. Set program parameters using global macros 

global numobs 100 

local seed 10101 

local xlist xvar 

* 7. Generate data and summarize 

set obs $numobs 

set seed “seed™ 

generate xvar = runiform() 

generate yvar = xvar^2 

summarize 

* 8. Demonstrate use of results stored in r() 

summarize xvar 

display "Sample range = " r(max)-r(min) 

regress yvar ~xlist°, vce(robust) 

scalar r2 = e(mss)/(e(mss)+e(rss) ) 

display "r-squared = " r2 

* 9. Close output file and exit Stata 

log close 

exit, clear 


1.12 Community-contributed commands 


We make extensive use of community-contributed commands. These are 
freely available ado-files (see section A.2.8) that are easy to install, provided 
you are connected to the Internet and, for computer lab users, that the 
computer lab places no restriction on adding components to Stata. They are 
then executed in the same way as Stata commands. 


As an example, consider instrumental-variables (Iv) estimation. In some 
cases, we know which community-contributed commands we want. For 
example, a leading community-contributed command for Iv is ivreg2, and 
we type search ivreg2 to get it. More generally, we can type the broader 
command 


search instrumental variables 
(output omitted ) 


This gives information on Iv commands available both within Stata and 
packages available on the web, provided you are connected to the Internet. 


Many entries are provided, often with several potential community- 
contributed commands and several versions of a given community- 
contributed command. The best place to begin can be an SJ article because 
this code is more likely to have been closely vetted for accuracy and written 
in a way suited to a range of applications. The listing from the search 
command includes 


SJ-7-4 st0030_3.. . . Enhanced routines for IV/GMM estimation and testing 
ee, END. Me, ee e A C. F. Baum, M. E. Schaffer, and S. Stillman 
(help ivactest, ivendog, ivhettest, ivreg2, ivreset, 

overid, ranktest if installed) 

Q4/07 SJ 7(4):465--506 

extension of IV and GMM estimation addressing hetero- 

skedasticity- and autocorrelation-consistent standard 

errors, weak instruments, LIML and k-class estimation, 

tests for endogeneity and Ramsey’s regression 

specification-error test, and autocorrelation tests 

for IV estimates and panel-data IV estimates 


The entry means that it is the third revision of the package (st0030_3), and 
the package is discussed in detail in sJ, volume 7, number 4 (sug-7-4). 


By left-clicking on the highlighted text st0030_ 3 on the first line of the 
entry, you will see a new window with title, description and authors, and 
installation files for the package. By left-clicking on the help files, you can 
obtain information on the commands. By left-clicking on (click here to 
install), you will install the files into an ado-directory. 


Many community-contributed programs are stored at the Statistical 
Software Components website. These can be directly installed using the ssc 
install command. The more general net command can be used to 
download Stata packages from any source—lInternet or physical media. 


Community-contributed commands are periodically updated. Entering 
adoupdate determines the current update level and provides the option to 
install updates to community-contributed files. 


1.13 Additional resources 


For first-time users, [GS] Getting Started with Stata is very helpful, along 
with analyzing an example dataset such as auto. dta interactively in Stata. 
The next useful manual is [U] Users Guide, especially the early chapters. 
For further resources, see section 1.2.2. 


1.14 Exercises 


— 


. Find information on the estimation method clogit using help and 


search. Comment on the relative usefulness of these search 
commands. 


. Download the Stata example dataset auto.dta. Obtain summary 


statistics for mpg and weight according to whether the car type is 
foreign (use the by foreign: prefix). Comment on any differences 
between foreign and domestic cars. Then, regressmpg on weight and 
foreign. Comment on any difference for foreign cars. 


. Write a do-file to repeat the previous question. This do-file should 


include a log file. Run the do-file, and then use a text editor to view the 
log file. 


. Using auto.dta, obtain summary statistics for the price variable. 


Then, use the results stored in r() to compute a scalar, cv, equal to the 
coefficient of variation (the standard deviation divided by the mean) of 


price. 


. Using auto.dta, regress mpg ON price and weight. Then, use the 


results stored in e () to compute a scalar, r2adj, equal to R- The 
adjusted R2 equals R? — (1 — R?)(K — 1)/(N — K), where N is the 
number of observations and K is the number of regressors including 
the intercept. Also, use the results stored in e() to calculate a scalar, 
tweight, equal to the ¢ statistic to test that the coefficient of weight is 
zero. 


. Using auto.dta, define a global macro named varlist for a variable 


list with mpg, price, and weight, and then obtain summary statistics 
for varlist. Repeat this exercise for a local macro named varlist. 


. Using auto.dta, use a foreach loop to create a variable, total, equal 


to the sum of headroom and length. Confirm by using summarize that 
total has a mean equal to the sum of the means of headroom and 
length. 


. Create a simulated dataset with 100 observations on two random 


variables that are each drawn from the uniform distribution. Use a seed 
of 12345. In theory, these random variables have a mean of 0.5 and a 
variance of 1/12. Does this appear to be the case here? 


Chapter 2 
Data management and graphics 


2.1 Introduction 


The starting point of an empirical investigation based on microeconomic 
data is the collection and preparation of a relevant dataset. The primary 
sources are often government surveys and administrative data. We assume 
the researcher has such a primary dataset and do not address issues of 
survey design and data collection. Even given primary data, it is rare that 
they will be in a form that is exactly what is required for ultimate analysis. 


The process of transforming original data to a form that is suitable for 
econometric analysis is referred to as data management. This is typically a 
time-intensive task that has important implications for the quality and 
reliability of modeling carried out at the next stage. 


This process usually begins with a data file or files containing basic 
information extracted from a census or a survey. They are often organized 
by data record for a sampled entity such as an individual, a household, or a 
firm. Each record or observation is a vector of data on the qualitative and 
quantitative attributes of each individual. Typically, the data need to be 
cleaned up and recoded, and data from multiple sources may need to be 
combined. The focus of the investigation might be a particular group or 
subpopulation, for example, employed women, so that a series of criteria 
need to be used to determine whether a particular observation in the dataset 
is to be included in the analysis sample. 


In this chapter, we present the tasks involved in data preparation and 
management. These include reading in and modifying data, transforming 
data, merging data, checking data, and selecting an analysis sample. The 
rest of the book focuses on analyzing a given sample, though special 
features of handling panel data and multinomial data are given in the 
relevant chapters. 


2.2 Types of data 


All data are ultimately stored in a computer as a sequence of 0s and Is because 
computers operate on binary digits, or bits, that are either 0 or 1. There are 
several different ways to do this, with potential to cause confusion. 


2.2.1 Text or ASCII data 


A standard text format is ASCII, an acronym for American Standard Code for 
Information Interchange. Regular ASCII represents 97 — 198, and extended 
ASCII represents 98 — 956 different digits, letters (uppercase and lowercase), 
and common symbols and punctuation marks. In either case, eight bits (called a 
byte) are used. As examples, 1 is stored as 00110001, 2 is stored as 00110010, 
3 is stored as 00110011, A is stored as (01010001, and a is stored as 00110001. 
A text file that is readable on a computer screen is stored in ASCII. 


A leading text-file example is a spreadsheet file that has been stored as a 
“comma-separated values” file, usually a file with the .csv extension. Here a 
comma is used to separate each data value; however, more generally, other 
separators can be used. 


Text-file data can also be stored as fixed-width data. Then no separator is 
needed provided we use the knowledge that, say, columns 1—7 have the first 
data entry, columns 8—9 have the second data entry, and so on. 


Text data can be numeric or nonnumeric. The letter a is clearly 
nonnumeric, but depending on the context, the number 3 might be numeric or 
nonnumeric. For example, the number 3 might represent the number of doctor 
visits (numeric) or be part of a street address, such as 3 Main Street 
(nonnumeric). 


2.2.2 Internal numeric data 


When data are numeric, the computer stores them internally using a format 
different from text to enable application of arithmetic operations and to reduce 
storage. The two main types of numeric data are integer and floating point. 


Because computers work with Os and 1s (a binary digit or bit), data are stored 
in base-2 approximations to their base-10 counterparts. 


For integer data, the exact integer can be stored. The size of the integer 
stored depends on the number of bytes used, where a byte is eight bits. For 
example, if one byte is used, then in theory 28 — 256 different integers could 
be stored, such as — 127, — 126, ..., 127, 128. 


Noninteger data, or often even integer data, are stored as floating-point 
data. Standard floating-point data are stored in four bytes, where the first bit 
may represent the sign, the next 8 bits may represent the exponent, and the 
remaining 23 bits may represent the digits. Although all integers have an exact 
base-2 representation, not all base-10 numbers do. For example, the base-10 
number 0.1 iS 0.00011 in base 2. Thus, the more bytes in the base-2 
approximation, the more precisely it approximates the base-10 number. 
Double-precision floating-point data use 8 bytes, have about 16 digits precision 
(in base 10), and are sufficiently accurate for most statistical calculations. 


Care is needed, however, in commands that rely on numerical equality. 
Because data are usually in base 10, but calculations are in base 2, numerical 
computation need not be exact. For example, while it is well known that 
(0.3 + 0.6 + 0.1) = (0.3 + 0.1 + 0.6), in fact the different orderings of the 
components of the sum lead to different results. We have 


. * Example of numerical error 
. display %25.20f 0.3+0.6+0.1 
> _n %25.20f 0.3+0.1+0.6 
> -n %25.20f (0.3+0.6+0.1)-(0.3+0.1+0.6) 
0.99999999999999988898 
1 .00000000000000000000 
-0.00000000000000011102 


For more details, see [U] 13.12 Precision and problems therein. 


Stata has the numeric storage types listed in table 2.1: three are integer and 
two are floating point. 


Table 2.1. Stata’s numeric storage types 


Storage type Bytes Minimum Maximum 


byte 1 —127 100 

int 2 —32, 767 32,740 

long 4 —2, 147, 483, 647 2,147,483,620 
float 4 —1.70141173319 x 1088 1.70141173319 x 10°8 
double 8 —8.9984656743 x 10307 8.9984656743 x 10307 


These internal data types have the advantage of taking fewer bytes to store 
the same amount of data. For example, the integer 123456789 takes up 9 bytes 
if stored as text but only 4 bytes if stored as an integer (Long) or floating point 
(float). For large or long numbers, the savings can clearly be much greater. 
The Stata default is for floating-point data to be stored as float and for 
computations to be stored as double. 


The compress command reduces the size of datasets by automatically 
converting data where appropriate to a storage type that uses fewer bytes. For 
example, a variable stored as float may be converted to int or byte. This 
command is particularly useful for very large datasets. 


Data read into Stata are stored using these various formats, and Stata data 
files (.dta) use these formats. One disadvantage is that numbers in internal- 
storage form cannot be read in the same way that text can; we need to first 
reconvert them to a text format. A second disadvantage is that it is not always 
easy to transfer data in internal format across packages, though the Stata 
import and export commands make this easy for a number of formats 
including Excel and SAS. 


It is much easier to transfer data that is stored as text data. Downsides, 
however, are an increase in the size of the dataset compared with the same 
dataset stored in internal numeric form, and possible loss of precision in 
converting floating-point data to text format. 


2.2.3 String data 


Nonnumeric data in Stata are recorded as strings, typically enclosed in double 
quotes, such as “3 Main Street”. The storage type str20, for example, states 


that the data should be stored as a string of length 20 characters. 


In this book, we focus on numeric data and seldom use strings. Stata has 
many commands for working with strings. Two useful commands are 
destring, which converts string data to integer data, and tostring, which 
does the reverse. 


2.2.4 Formats for displaying numeric data 


Stata output and text files written by Stata format data for readability. The 
format is automatically chosen by Stata but can be overridden. 


The most commonly used format is the £ format, or the fixed format. An 
example is 37.2£, which means the number will be right justified and fill 7 
columns with 2 digits after the decimal point. For example, 123.321 is 
represented as 123.32. 


The format type always begins with s. The default of right justification is 
replaced by left justification if an optional - follows. Then follows an integer 
for the width (number of columns), a period (.), an integer for the number of 
digits following the decimal point, and an e or an £ or a g for the format used. 
An optional c at the end leads to comma format. 


The usual format is the £ format, or fixed format, for example, 123.32. The 
e, or exponential, format (scientific notation) is used for very large or small 
numbers, for example, 1.23321e+02. The g, or general format, leads to e or £ 
being chosen by Stata in a way that will work well regardless of whether the 
data are very large or very small. In particular, the format 3#. (#-1) g will vary 
the number of columns after the decimal point optimally. For example, 8.7, 
will present a space followed by the first six digits of the number and the 
appropriately placed decimal point. 


To see the various possible formats, enter help format. 


2.3 Inputting data 


The starting point is the computer-readable file that contains the raw data. 
Where large datasets are involved, this is typically either a text file or the 
output of another computer program, such as Excel, SAS, or even Stata. 


2.3.1 General principles 


For a discussion of initial use of Stata, see chapter 1. 


To replace any existing dataset in memory, you need to first clear the 
current dataset. 


. * Remove current dataset from memory 
. clear 


This removes data and any associated value labels from memory. If you are 
reading in data from a Stata dataset, you can instead use the clear option 
with the use command. Various arguments of clear lead to additional 
removal from memory of matrices, scalars, constraints, clusters, stored 
results, programs, frames, collections of results, and sersets and Mata 
functions, as well as to closing all open files and postfiles, clearing the class 
system, closing any open Graph windows and dialog boxes, and resetting all 
timers to zero. The clear all command removes all of these. 


Various commands are used to read in data, depending on the format of 
the file being read. These commands, discussed in detail in the rest of this 
section, include the following: 


e use to read a Stata dataset (with extension .dta) 

e edit and input to enter data from the Data Editor or the keyboard 

e variations of the import command, detailed below, to read data in 
nontext formats such as Microsoft Excel worksheets 

e odbc to read data from Open Database Connectivity sources 

e jdbc to read data using Java Database Connectivity 

e import delimited to read comma-separated or tab-separated text data 
created by a spreadsheet 


e infile to read unformatted (free format) separated text data 
e infix to read fixed-column format text data 
e infile to read fixed-column format text data (more flexible than 


infix) 


As soon as data are input into Stata, you should save the data as a Stata 
dataset. For example, 


. * Save data as a Stata dataset 
. save mydata, replace 


(output omitted ) 


The replace option will replace any existing dataset with the same name. If 
you do not want this to happen, then do not use the option. 


To check that data are read in correctly, list the first few observations, 
use describe, and obtain the summary statistics. 


. * Quick check that data are read in correctly 
. list in 1/5 // List the first five observations 
(output omitted ) 


. describe // Describe the variables 
(output omitted ) 


. summarize // Descriptive statistics for the variables 
(output omitted ) 


Examples illustrating the output from describe and summarize are given in 
sections 2.4.1 and 3.2. 


2.3.2 Inputting data already in Stata format 


Data in the Stata format are stored with the .dta extension, for example, 
mydata.dta. Then the data can be read in with the use command. For 
example, 


. * Read in existing Stata dataset 
. use c:\research\mydata, clear 


The clear option removes any data currently in memory, even if the current 
data have not been saved, enabling the new file to be read into memory. 


If Stata is initiated from the current directory, then we can more simply 
type 


. * Read in dataset in current directory 
. use mydata, clear 


The use command also works over the Internet, provided that your computer 
is connected. For example, you can obtain an extract from the 1980 
U.S. Census by typing 


. * Read in dataset from an Internet website 
. use http://www.stata-press.com/data/r17/census 
(1980 Census data by state) 


2.3.3 Inputting data from the keyboard 


The input command enables data to be typed in from the keyboard. It 
assumes that data are numeric. If instead data are characters, then input 
should additionally define the data as a string and give the string length. For 
example, 


. * Data input from keyboard using input 
. Clear 


. input str20 name age female income 


name age female income 
"Barry" 25 0 40.990 
"Carrie" 30 1 37.000 
"Gary" 31 0 48.000 


PWN 


. end 


The quotes here are not necessary; we could use Barry rather than "Barry". 
If the name includes a space, such as "Barry Jr", then double quotes are 
needed; otherwise, Barry would be read as a string, and then sr would be 
read as a number, leading to a program error. 


To check that the data are read in correctly, we use the 1ist command. 
Here we add the clean option, which lists the data without divider and 
separator lines. 


. list, clean 


name age female income 


1. Barry 25 0 40.99 
2. Carrie 30 1 37 
3. Gary 31 0 48 


2.3.4 Inputting nontext data 


By nontext data, we mean data that are stored in the internal code of a 
software package other than Stata. It is easy to establish whether a file is a 
nontext file by viewing the file using a text editor. If strange characters 
appear, then the file is a nontext file. An example is an Excel .x1s file. 


Stata supports several special formats. Specifically, 


e the import dbase command reads a version III or IV dBase (. dbf) file. 

e the import excel command reads worksheets from Microsoft Excel 
(.xls and .xlsx) workbooks. 

e the import fred command reads individual Federal Reserve Economics 
Data series. 

e the import haver command reads individual Haver Analytics database 
series. 

e the import sas command reads version 7 SAS (.sas7bdat) files. 

° the import sasxport5 and import sasxport8 commands read SAS 
XPORT Transport format files. 

e the import spss command reads version 7 SAs (. sav and .zsav) files. 

e the odbc command reads Open Database Connectivity data files. 

e the jdbc command reads data using Java Database Connectivity 

e the spshape2dta command for geospatial data translates .dbf and . shp 
files of a shapefile into two Stata datasets. 


Here we detail the import excel command. The default is to import the 
first worksheet in the Excel file. The following example imports a selected 
sheet, here called grades, in which the sheet name needs to be provided. The 
firstrow option is used if the first row of the worksheet has variable names. 


. * Read in a specific worksheet from an Excel workbook 
. import excel myworkbook.xlsx, sheet("grades") firstrow clear 


The commercial software package Stat/Transfer supports file conversion, 
such as from MATLAB to Stata, for many file formats. 


2.3.5 Inputting text data from a spreadsheet 


The import delimited command, which supplants the earlier insheet 
command, reads data that are saved by a spreadsheet or database program as 
comma-separated or tab-separated text data. For example, mus202filel.csv, 
a file with comma-separated values, has the following data: 


name ,age,female, income 
Barry ,25,0,40.990 
Carrie,30,1,37.000 
Gary ,31,0,48.000 


To read these data, we use import delimited. Thus, 


. x Read data from a .csv file that includes variable names using import delimited 
. clear 


. import delimited using mus202filel.csv 
(encoding automatically selected: IS0-8859-2) 
(4 vars, 3 obs) 


. list, clean 


name age female income 


Lis Barry 25 (0) 40.99 
2. Carrie 30 1 37 
3. Gary 31 0 48 


Stata automatically recognized the name variable to be a string variable, the 
age and female variables to be integer, and the income variable to be 
floating point. 


A major advantage of import delimited is that it can read in a text file 
that includes variable names as well as data, making mistakes less likely. 
There are some limitations, however. The import delimited command is 
restricted to files with a single observation per line. And the data must be 
comma-separated or tab-separated, but not both. It cannot be space- 
separated, but other delimiters can be specified by using the delimiter 
option. 


The first line with variable names is optional. Let mus202file2.csv be 
the same as the original file, except without the header line: 


Barry ,25,0,40.990 
Carrie,30,1,37.000 
Gary ,31,0,48.000 


The import delimited command still works. By default, the variables read 
in are given the names v1, v2, v3, and v4. Alternatively, you can assign more 
meaningful names in import delimited. For example, 


. * Read data from a .csv file without variable names, and assign names 
clear 


import delimited name age female income using mus202file2.csv 
(encoding automatically selected: IS0-8859-2) 
(4 vars, 3 obs) 


2.3.6 Inputting text data in free format 


The infile command reads free-format text data that are space-separated, 
tab-separated, or comma-separated. 


We again consider mus202f£i1le2.csv, which has no header line. Then 


. * Read data from free-format text file using infile 
. Clear 


infile str20 name age female income using mus202file2.csv 
(3 observations read) 


. list, clean 


name age female income 


1. Barry 25 0 40.99 
2. Carrie 30 1 37 
3. Gary 31 0 48 


By default, infile reads in all data as numbers that are stored as floating 
point. This causes obvious problems if the original data are string. By 
inserting str20 before name, the first variable is instead a string that is stored 
as a String of at most 20 characters. 


For infile, a single observation is allowed to span more than one line, 
or there can be more than one observation per line. Essentially, every fourth 


entry after Barry will be read as a string entry for name, every fourth entry 
after 25 will be read as a numeric entry for age, and so on. 


The infile command is the most flexible command to read in data and 
will also read in fixed-format data. 


2.3.7 Inputting text data in fixed format using infix 


The infix command reads fixed-format text data that are in fixed-column 
format. For example, suppose mus202file3.txt contains the same data as 
before, except without the header line and with the following fixed format: 


Barry 250 40.990 
Carrie 301 37.000 
Gary 310 48.000 


Here columns 1—10 store the name variable, columns 11—12 store the age 
variable, column 13 stores the female variable, and columns 14-20 store the 
income variable. 


Note that a special feature of fixed-format data is that there need be no 
separator between data entries. For example, for the first observation, the 
sequence 250 1s not age of 250 but is instead two variables: age = 25 and 
female = 0. It is easy to make errors when reading fixed-format data. 


To use infix, we need to define the columns in which each entry 
appears. There are a number of ways to do this. For example, 


. * Read data from fixed-format text file using infix 
. clear 


. infix str20 name 1-10 age 11-12 female 13 income 14-20 using mus202file3.txt 
(3 observations read) 


. list, clean 


name age female income 


1. Barry 25 0 40.99 
2. Carrie 30 1 37 
3. Gary 31 0 48 


As with infile, we include str20 to indicate that name is a string rather than 
a number. 


A single observation can appear on more than one line. Then we use the 
symbol / to skip a line or use the entry 2:, for example, to switch to line 2. 
For example, suppose mus202file4.txt is the same as mus202file3.txt, 
except that income appears on a separate second line for each observation in 
columns 1-7. Then, 


. * Read data using infix where an observation spans more than one line 
. clear 


. infix str20 name 1-10 age 11-12 female 13 2: income 1-7 using mus202file4.txt 
(3 observations read) 


2.3.8 Inputting text data in fixed format using infile and a dictionary 


For simple fixed-format text datasets, the infix command is adequate. For 
more complicated fixed-format text datasets, the format for the data being 
read in can be stored in a dictionary file, a text file created by a word 
processor, or an editor. Details are provided in [D] infile (fixed format). 
Suppose this file is called mus202dict.dct and the data are in file 
mus202file3.txt. Then we type 


. * Read in data with dictionary file 
. infile using mus202dict.dct, using(mus202file3.txt) 


where the dictionary file mus202dict.dct provides variable names and 
formats. 


2.3.9 Common pitfalls 


It can be surprisingly difficult to read in data. With fixed-format data, wrong 
column alignment leads to errors. Data can unexpectedly include string data, 
perhaps with embedded blanks. Missing values might be coded as not 
applicable, causing problems if a numeric value is expected. An observation 
can span several lines when a single line was erroneously assumed. 


It is possible to read a dataset into Stata without Stata issuing an error 
message; no error message does not mean that the dataset has been 
successfully read in. For example, transferring data from one computer type 
to another, such as a file transfer using File Transfer Protocol, can lead to an 
additional carriage return, or Enter, being typed at the end of each line. Then 


infix reads the dataset as containing one line of data, followed by a blank 
line, then another line of data, and so on. The blank lines generate 
extraneous observations with missing values. 


You should always perform checks, such as using list and summarize. 
Always view the data before beginning analysis. 


2.3.10 Outputting data 


Stata datasets can be output to other formats. Text data files can be created 
using the export delimited command, for data separated using a comma or 
other separator, or the more flexible out file command (see section 2.4.8), 
which creates both separated data files and fixed-format data files. 


Nontext special format data files can be created using commands export 
dbase, export excel, odbc, jdbc, export sasxports, and export 
sasxports. 


2.4 Data management 


Once the data are read in, there can be considerable work in cleaning up the 
data, transforming variables, and selecting the final sample. All data 
management tasks should be recorded, dated, and saved. The existence of 
such a record makes it easier to track changes in definitions and eases the 
task of replication. By far, the easiest way to do this is to have the data 
management manipulations stored in a do-file rather than to use commands 
interactively. We assume that a do-file is used. 


2.4.1 Panel Study of Income Dynamics example 


Data management is best illustrated using a real-data example. Typically, 
one needs to download the entire original dataset and an accompanying 
document describing the dataset. For some major commonly used datasets, 
however, there may be cleaned-up versions of the dataset, simple data- 
extraction tools, or both. 


Here we obtain a very small extract from the 1992 Individual-Level data 
from the Panel Study of Income Dynamics (PsID), a U.S. longitudinal survey 
conducted by the University of Michigan. The extract was downloaded from 
the Data Center at the website https://psidonline.isr.umich.edu/, using 
interactive tools to select just a few variables. The extracted sample was 
restricted to men aged 30-50 years. The output conveniently included a Stata 
do-file in addition to the text data file. Additionally, a codebook describing 
the variables selected was provided. The data download included several 
additional variables that enable unique identifiers and provide sample 
weights. These should also be included in the final dataset but, for brevity, 
have been omitted below. 


Reading the text dataset mus202psid92m. txt using a text editor reveals 
that the first two observations are 


4^ 37 17 27 17 24827 17 107 407 97 220007 2340 
4^ 1707 17> 27 1° 69747 17> 10° 377 12° 314687 2008 


The data are text data delimited by the symbol ^. 


Several methods could be used to read the data, but the simplest is to use 
import delimited. This is especially simple given the provided do-file. The 
mus202psid92m.do file contains the following information: 


* Commands to read in data from PSID extract in a delimited file 
. type mus202psid92m.do 
* mus202psid92m.do 


clear 
#delimit ; 
* PSID DATA CENTER 3200 COORG I RK FK FK a ÞK 2k ok 
JOBID : 10654 
DATA_DOMAIN : PSID 
USER_WHERE : ER32000=1 and ER30736 ge 30 and ER 
FILE_TYPE : All Individuals Data 
OUTPUT_DATA_TYPE : ASCII Data File 
STATEMENTS : STATA Statements 
CODEBOOK_TYPE : PDF 
N_OF_VARIABLES 12 
N_OF_OBSERVATIONS: 4290 
MAX_REC_LENGTH : 56 
DATE & TIME : November 3, 2003 @ 0:28:35 


TEC PC CTCL CC CCCCCCCCCCCCCCCCCCTCTCCCSCCCCCCCCSSCCOCSOSCCOSSC COSC SC SSS TT LET ee 

import delimited 
er30001 er30002 er32000 er32022 er32049 er30733 er30734 er30735 er30736 
er30748 er30750 er30754 

using mus202psid92m.txt, delim("“") clear 

destring, replace ; 

label variable er30001 "1968 INTERVIEW NUMBER" ; 

label variable er30002 "PERSON NUMBER 68" ; 

label variable er32000 "SEX OF INDIVIDUAL" ; 

label variable er32022 "# LIVE BIRTHS TO THIS INDIVIDUAL" ; 

label variable er32049 "LAST KNOWN MARITAL STATUS" ; 

label variable er30733 "1992 INTERVIEW NUMBER" ; 


label variable er30734 "SEQUENCE NUMBER 92" ; 
label variable er30735 "RELATION TO HEAD SP 
label variable er30736 "AGE OF INDIVIDUAL 92" ; 
label variable er30748 "COMPLETED EDUCATION 92" ; 
label variable er30750 "TOT LABOR INCOME 92" ; 
label variable er30754 "ANN WORK HRS 92" ; 
#delimit cr; // Change delimiter to default cr 


To read the data, we need only import delimited. The code separates 
commands using the delimiter ; rather than the default cr (the Enter key or 
carriage return) to enable comments and commands that span several lines. 


The destring command, unnecessary here, converts any string data into 
numeric data. For example, $1,234 would become 1234. The label 
variable command provides a longer description of the data that will be 
reproduced by using describe. 


Executing this code yields output that includes the following: 


(12 vars, 4290 obs) 
destring, replace ; 

er30001 already numeric; no replace 
(output omitted ) 

er30754 already numeric; no replace 


The statement already numeric 1s output for all variables because all the 
data in mus202psid92m.txt are numeric. 


The describe command provides a description of the data: 


. * Data description 
. describe 


Contains data 


Observations: 4,290 

Variables: 12 
Variable Storage Display Value 

name type format label Variable label 
er30001 int 48 .0g 1968 INTERVIEW NUMBER 
er30002 int 48 Og PERSON NUMBER 68 
er32000 byte 48 .0g SEX OF INDIVIDUAL 
er32022 byte 48 .0g # LIVE BIRTHS TO THIS INDIVIDUAL 
er32049 byte 48 .0g LAST KNOWN MARITAL STATUS 
er30733 int 48 .0g 1992 INTERVIEW NUMBER 
er30734 byte 48 Og SEQUENCE NUMBER 92 
er30735 byte 48 .0g RELATION TO HEAD 92 
er30736 byte 48 .0g AGE OF INDIVIDUAL 92 
er30748 byte 48 .0g COMPLETED EDUCATION 92 
er30750 long 412.0g TOT LABOR INCOME 92 
er30754 int 48 Og ANN WORK HRS 92 
Sorted by: 


Note: Dataset has changed since last saved. 


The summarize command provides descriptive statistics: 


. * Data summary 


. summarize 
Variable Obs Mean Std. dev. Min Max 
er30001 4,290 4559.2 2850.509 4 9308 
er30002 4,290 60.66247 79.93979 1 227 
er32000 4,290 1 (0) 1 1 
er32022 4,290 21.35385 38.20765 1 99 
er32049 4,290 1.699534 1.391921 1 9 
er30733 4,290 4911.015 2804.8 1 9829 
er30734 4,290 3.179487 11.4933 1 81 
er30735 4,290 13.33147 12.44482 10 98 
er30736 4,290 38.37995 5.650311 30 50 
er30748 4,290 14.87249 15.07546 (0) 99 
er30750 4,290 27832.68 31927.35 (0) 999999 
er30754 4,290 1929.477 899.5496 (0) 5840 


Satisfied that the original data have been read in carefully, we proceed 
with cleaning the data. 


2.4.2 Naming and labeling variables 


Just as the Data Editor can be used to input and manage data, the Variables 
Manager can be used to manage the properties of variables, such as their 
names and labels. We use Stata commands below to rename and label 
variables, but we could also have used the Variables Manager. 


The first step is to give more meaningful names to variables by using the 


rename command. We do so just for the variables used in subsequent 


analysis. 


. * Rename variables 


.- rename 


. rename 


.- rename 


.- rename 


. rename 


er32000 
er30736 
er30748 
er30750 
er30754 


sex 

age 
education 
earnings 


hours 


The renamed variables retain the descriptions that they were originally 
given. Some of these descriptions are unnecessarily long, so we use label 


variable to shorten output from commands, such as describe, that give the 
variable labels. 


. * Relabel some of the variables 
. label variable age "Age of individual" 


. label variable education "Completed education" 
. label variable earnings "Total labor income" 


. label variable hours "Annual work hours" 


For categorical variables, it can be useful to explain the meanings of the 
variables. For example, from the codebook discussed in section 2.4.4, the 
er32000 variable takes on the value 1 if male and 2 if female. We may prefer 
that the output of variable values uses a label in place of the number. These 
labels are provided by using label define together with label values. 


* Define the label gender for the values taken by variable sex 
. label define gender 1 male 2 female 


. label values sex gender 
. list sex in 1/2, clean 


sex 
1. male 
2. male 


After renaming, we obtain 


* Data summary of key variables after renaming 
summarize sex age education earnings hours 


Variable Obs Mean Std. dev. Min Max 
sex 4,290 1 (0) 1 1 

age 4,290 38.37995 5.650311 30 50 
education 4,290 14.87249 15.07546 (0) 99 
earnings 4,290 27832.68 31927.35 (0) 999999 
hours 4,290 1929.477 899.5496 (0) 5840 


Data exist for these variables for all 4,290 sample observations. The data 
have 30 < age < 50 and sex = 1 (male) for all observations, as expected. 
The maximum value for earnings is $999,999, an unusual value that most 
likely indicates top-coding. The maximum value of hours is quite high and 
may also indicate top-coding (365 x 16 = 5840). The maximum value of 99 
for education is clearly erroneous; the most likely explanation is that this is 


a missing-value code because numbers such as 99 or — 99 are often used to 
denote a missing value. 


2.4.3 Viewing data 


The standard commands for viewing data are summarize, list, and 
tabulate. 


We have already illustrated the summarize command. Additional 
statistics, including key percentiles and the five largest and smallest 
observations, can be obtained by using the detail option; see section 3.2.4. 


The list command can list every observation, too many in practice. But 
you could list just a few observations: 


* List first 2 observations of two of the variables 
. list age hours in 1/2, clean 


age hours 
1. 40 2340 
2. 37 2008 


The 1ist command with no variable list provided will list all the variables. 
The clean option eliminates dividers and separators. 


The format command sets the display format of one or more variables. 
For example, 


. * Change display format of variable hours 
. format %415.2f hours 


. list age hours in 1/2, clean 


age hours 
1. 40 2340.00 
2. 37 2008.00 


The count command counts the number of observations satisfying a 
given criterion. For example, to count the number of persons with less than 
12 years of education, we type 


* Count number of observations satisfying a given condition 
count if education < 12 
934 


The tabulate command lists each distinct value of the data and the 
number of times it occurs. It is useful for data that do not have too many 
distinctive values. For education, we have 


. * Tabulate all values taken by a single variable 
. tabulate education 


Completed 

education Freq. Percent Cum. 
0 82 1.91 1.91 
1 T 0.16 2.07 
2 20 0.47 2.54 
3 32 0.75 3.29 
4 26 0.61 3.89 
5 30 0.70 4.59 
6 123 2.87 7.46 
T 35 0.82 8.28 
8 78 1.82 10.09 
9 117 2.73 12.82 
10 167 3.89 16.71 
11 217 5.06 21.77 
12 1,510 35.20 56.97 
13 263 6.13 63.10 
14 432 10.07 73.17 
15 172 4.01 77.18 
16 535 12.47 89.65 
17 317 7.39 97.04 
99 127 2.96 100.00 


Total 4,290 100.00 


Note that the variable label rather than the variable name is used as a header. 
The values are generally plausible, with 35% of the sample having a highest 
grade completed of exactly 12 years (high school graduate). The 7% of 
observations with 17 years most likely indicates a postgraduate degree (a 
college degree is only 16 years). The value 99 for 3% of the sample most 
likely is a missing-data code. Surprisingly, 2% appear to have completed no 
years of schooling. As we explain next, these are also observations with 
missing data. 


2.4.4 Using original documentation 


At this stage, it is really necessary to go to the original documentation. 


The mus202psid92mcb.pdf file, generated as part of the data extraction 
from the PSID website, states that for the er30748 variable a value of 0 means 
“inappropriate” for various reasons given in the codebook; the values 1—16 
are the highest grade or year of school completed; 17 is at least some 
graduate work; and 99 denotes not applicable or did not know. 


Clearly, the education values of both 0 and 99 denote missing values. 
Without using the codebook, we may have misinterpreted the value of 0 as 
meaning 0 years of schooling. 


2.4.5 Missing values 


It is best at this stage to flag missing values and to keep all observations 
rather than to immediately drop observations with missing data. In later 
analysis, only those observations with data missing on variables essential to 
the analysis need to be dropped. The characteristics of individuals with 
missing data can be compared with those having complete data. Data with a 
missing value are recoded with a missing-value code. 


For education, the missing-data values 0 or 99 are replaced by . (a 
period), which is the default Stata missing-value code. Rather than create a 
new variable, we modify the current variable by using replace, as follows: 


. * Replace missing values with missing-data code 
. replace education = . if education == 0 | education == 99 
(209 real changes made, 209 to missing) 


Using the double equality and the symbol | for the logical operator or is 
detailed in section 1.3.6. As an example of the results, we list observations 
46-48: 


. * Listing of variable including missing value 
. list education in 46/48, clean 


educat~n 
46. 12 
47. . 
48. 16 


Evidently, the original data on education for the 47th observation equaled 0 
or 99. This has been changed to missing. 


Subsequent commands using the education variable will drop 
observations with missing values. For example, 


* Example of data analysis with some missing values 
summarize education age 


Variable Obs Mean Std. dev. Min Max 
education 4,081 12.5533 2.963696 1 17 
age 4,290 38.37995 5.650311 30 50 


For education, only the 4,081 nonmissing values are used, whereas for age, 
all 4,290 of the original observations are available. 


If desired, you can use more than one missing-value code. This can be 
useful if you want to keep track of reasons why a variable is missing. The 


extended missing codes are .a, .b, ..., .z. For example, we could instead 
have typed 
* Assign more than one missing code 
. replace education = .a if education == 0 
. replace education = .b if education == 99 


When we want to apply multiple missing codes to a variable, it is more 
convenient to use the mvdecode command, which is similar to the recode 
command (discussed in section 2.4.7), which changes variable values or 
ranges of values into missing-value codes. The reverse command, mvencode, 
changes missing values to numeric values. 


Care is needed when missing values are used. In particular, missing 
values are treated as large numbers, higher than any other number. The 
ordering is that all numbers are less than ., which is less than .a, and so on. 
The command 


* This command will include missing values 
list education in 40/60 if education > 16, clean 


educat~n 
45. 17 
47. . 
60. 17 


lists the missing value for observation 47 in addition to the two values of 17. 
If this is not desired, we should instead use 


. * This command will not include missing values 
. list education in 40/60 if education > 16 & !missing(education), clean 


educat"n 
45. 17 
60. 17 


Now observation 47 with the missing observation has been excluded. 


The issue of missing values also arises for earnings and hours. From 
the codebook, we see that a zero value may mean missing for various 
reasons, or it may be a true zero if the person did not work. True zeros are 
indicated by er30749=0 or 2, but we did not extract this variable. Thus, it is 
not unusual to have to extract data several times. Rather than extract this 
additional variable, as a shortcut we note that earnings and hours are 
missing for the same reasons that education is missing. Thus, 


. * Replace missing values with missing-data code 
. replace earnings = . if missing(education) 
(209 real changes made, 209 to missing) 


. replace hours = . if missing(education) 
(209 real changes made, 209 to missing) 


2.4.6 Imputing missing data 


The standard approach in microeconometrics is to drop observations with 
missing values, called listwise deletion. The loss of observations generally 
leads to less precise estimation and inference. More importantly, it may lead 
to sample-selection bias in regression if the retained observations have 
unrepresentative values of the dependent variable conditional on regressors; 
see section 19.10. 


An alternative to dropping observations is to impute missing values. The 
norm in microeconometrics studies is to use only the original data. The 
ipolate command uses linear interpolation. The community-contributed 
hotdeck command (Mander and Clayton 1999) implements hotdeck 
imputation. A more promising approach, though one more advanced, is 
multiple imputation. This produces M different imputed datasets (for 
example, M = 20), fits the model M times, and performs inference that 
allows for the uncertainty in both estimation and data imputation. Multiple 
implementation using the mi impute command is presented in section 30.5. 


2.4.7 Transforming data (generate, replace, egen, recode) 


After handling missing values, we have the following for the key variables: 


* Summarize cleaned-up data 
summarize sex age education earnings 


Variable Obs Mean Std. dev. Min Max 
sex 4,290 1 (0) 1 1 
age 4,290 38.37995 5.650311 30 50 
education 4,081 12.5533 2.963696 1 17 
earnings 4,081 28706.65 32279.12 (0) 999999 


We now turn to recoding existing variables and creating new variables. 
The basic commands are generate and replace. It can be more convenient, 
however, to use the additional commands recode, egen, and tabulate. 
These are often used in conjunction with the if qualifier and the by prefix. 
We present many examples throughout the book. 


The generate and replace commands 
The generate command is used to create new variables, often using 


standard mathematical functions. The syntax of the command is 


generate | type | newvar| :lblname | =exp [ af | lin] [. before (varname) after (varname) | 


where for numeric data the default type is float, but this can be changed, 
for example, to double. 


It is good practice to assign a unique identifier to each observation if one 
does not already exist. A natural choice is to use the current observation 
number stored as the system variable _n. 


. * Create identifier using generate command 
. generate id = _n 


We use this identifier for simplicity, though for these data the er30001 and 
er30002 variables when combined provide a unique PSID identifier. 


The following command creates a new variable for the natural logarithm 
of earnings: 


* Create new variable using generate command 
. generate lnearns = ln(earnings) 
(498 missing values generated) 


Missing values for 1n (earnings) are generated whenever earnings data are 
missing. Additionally, missing values arise when earnings < 0 because it 
is then not possible to take on the logarithm. 


The replace command is used to replace some or all values of an 
existing variable. We already illustrated this when we created missing-values 
codes. 


The egen command 


The egen command is an extension to generate that enables creation of 
variables that would be difficult to create using generate. For example, 
suppose we want to create a variable that for each observation equals sample 
average earnings provided that sample earnings are nonmissing. The 
command 


* Create new variable using egen command 
egen aveearnings = mean(earnings) if !missing(earnings) 
(209 missing values generated) 


creates a variable equal to the average of earnings for those observations not 
missing data on earnings. 


The egen command supports many functions; see help egen. It is often 
used in conjunction with the by prefix. 


The recode command 


The recode command is an extension to replace that recodes categorical 
variables and generates a new variable if the generate () option is used. The 
command 


* Replace existing data using the recode command 
. recode education (1/11=1) (12=2) (13/15=3) (16/17=4), generate(edcat) 
(4074 differences between education and edcat) 


creates a new variable, edcat, that takes on a value of 1, 2, 3, or 4 
corresponding to, respectively, less than high school graduate, high school 
graduate, some college, and college graduate or higher. The edcat variable 
is set to missing if education does not lie in any of the ranges given in the 
recode command. 


The decode command 


The decode command converts numeric data to string data, where the string 
values are those given in a preexisting label. 


The numeric variable sex has label gender. We create a string variable 
named str_sex as follows: 


* Convert numeric variable to string using decode and preexisting label 


list sex in 1, nolabel clean // List the numeric value 
sex 
1. 1 
list sex in 1, clean // List the associated label 
sex 
1. male 


. decode sex, generate (str_sex) // Make a new string variable 


list str_sex in 1, clean // List the new string variable 
str_sex 
1. male 


The first two 1ist commands show that the first observation of the numeric 
variable sex had value 1 and label value male. The decode command then 
created a string variable whose values are determined by whatever label has 
previously been assigned to sex. This yields a variable whose first 
observation is male. 


The encode command 


The encode command converts string data to numeric data and creates an 
associated label whose values are those of the original string variable. 


We convert the just created string variable str_sex to a numeric 
variable. 


* Convert string variable to numeric with label using encode 
encode str_sex, generate(num_sex) // Make a new numeric variable 


list num_sex in 1, nolabel clean // List the numeric value 


num_sex 

1. 1 

list num_sex in 1, clean // List the associated label 
num_sex 

1. male 


The value label for the newly created numeric variable num_sex is also given 
the name num_ sex. 


The by prefix 


The by varlist: prefix repeats a command for each group of observations for 
which the variables in varlist are the same. The data must first be sorted by 
varlist. This can be done by using the sort command, which orders the 
observations in ascending order according to the variables given in the 
command. 


The sort command and the by prefix are more compactly combined into 
the bysort prefix. For example, suppose we want to create for each 
individual a variable that equals the sample average earnings for all persons 
with that individual’s years of education. Then we type 


* Create new variable using bysort prefix 
. bysort education: egen aveearnsbyed = mean(earnings) 
(209 missing values generated) 


sort id 


The final command, one that returns the ordering of the observation to the 
original ordering, is not required. But it could make a difference in 
subsequent analysis if, for example, we were to work with a subsample of 
the first 1,000 observations. 


Indicator variables 


Consider creating a variable indicating whether earnings are positive. While 
there are several ways to proceed, we describe only our recommended 
method. 


The most direct way is to use generate with logical operators: 


. * Create indicator variable using generate command with logical operators 
. generate d1 = earnings > O if !missing(earnings) 
(209 missing values generated) 


The expression d1 = earnings > 0 creates an indicator variable equal to 1 
if the condition holds and 0 otherwise. Because missing values are treated as 
large numbers, we add the condition if !missing (earnings) so that in 
those cases d1 is set equal to missing. 


Using summarize, we obtain 


. summarize di 


Variable Obs Mean Std. dev. Min Max 


dl 4,081 929184 . 2565486 (0) 1 


We can see that about 93% of the individuals in this sample had some 
earnings in 1992. We can also see that we have 0.929184 x 4081 = 3792 
observations with a value of 1, 289 observations with a value of 0, and 209 
missing observations. 


Set of indicator variables 


A complete set of mutually exclusive categorical indicator dummy variables 
can be created in several ways. 


For example, suppose we want to create mutually exclusive indicator 
variables for less than high school graduate, high school graduate, some 
college, and college graduate or more. The starting point is the edcat 
variable, created earlier, which takes on the values 1—4. 


We can use tabulate with the generate () option. 


* Create a set of indicator variables using tabulate with generate() option 


. qui tabulate edcat, generate (eddummy) 


summarize eddummy* 


Variable Obs Mean Std. dev. Min Max 
eddummy1i 4,081 . 2087724 . 4064812 (0) 1 
eddummy2 4,081 . 3700074 . 4828655 (0) 1 
eddummy3 4,081 . 2124479 . 4090902 0 1 
eddummy4 4,081 . 2087724 . 4064812 (0) 1 


The four means sum to one, as expected for four mutually exclusive 
categories. Note that if edcat had taken on values 4, 5, 7, and 9, rather than 
1—4, it would still generate variables numbered eddummy1—eddummy4. 


It is usually not necessary to actually create a set of indicator variables. 
Instead, we can include factor variables in the variable list; see section 1.3.4. 
For example, 


. * Set of indicator variables using factor variables - no category is omitted 
. Summarize i.edcat 


Variable Obs Mean Std. dev. Min Max 
edcat 

1 4,081 . 2087724 . 4064812 0 1 

2 4,081 . 3700074 . 4828655 (0) 1 

3 4,081 . 2124479 . 4090902 0 1 

4 4,081 . 2087724 . 4064812 0 1 


Almost all commands with a variable list permit use of factor variables 
in the variable list. Exceptions include a few estimation commands such as 
the exlogistic command. In such cases, the older xi prefix can be used 
instead. For details, type help xi. 


Interactions 


Interactive variables can be created in the obvious manner. For example, to 
create an interaction between the binary earnings indicator a1 and the 
continuous variable education, type 


* Create interactive variable using generate commands 
. generate dieducation = dl*education 
(209 missing values generated) 


Rather than create the interactive variables, we can use factor-variable 
operators. For example, 


* Set of interactions using factor-variable operators 
summarize i.edcat#c.earnings 


Variable Obs Mean Std. dev. Min Max 
edcat# 
c.earnings 
1 4,081 3146.368 8286.325 (0) 80000 
2 4,081 8757 .823 15710.76 (0) 215000 
3 4,081 6419.347 16453.14 (0) 270000 
4 4,081 10383.11 32316.32 (0) 999999 


Here the # operator is used to create interactions, the i. operator is applied 
to a categorical variable, and the c. operator is for a continuous variable. 


We can similarly use factor-variable operators to create interactions 
between categorical variables and to create interactions between continuous 
variables; see section 1.3.4. 


Factor-variable operators enable one to obtain marginal effects in 
regression models with interactions using the margins command. 


Demeaning 


Suppose we want to include a quadratic in age as a regressor. The marginal 
effect of age is much easier to interpret if we use the demeaned variables 
(age— age) and (age— ageé)2 as regressors. 

* Create demeaned variables 
. egen double aveage = mean(age) 
. generate double agedemean = age - aveage 


. generate double agesqdemean = agedemean”2 


. Summarize agedemean agesqdemean 
Variable Obs Mean Std. dev. Min Max 


agedemean 4,290 2.32e-15 5.650311 -8.379953 11.62005 
agesqdemean 4,290 31.91857 32.53392 . 1443646 135.0255 


We expect the agedemean variable to have an average of zero. We 
specified double to obtain additional precision in the floating-point 
calculations. In the case at hand, the mean of agedemean is on the order of 
10—15 instead of 10-6, which is what single-precision calculations would 
yield. 


2.4.8 Saving and exporting data 


At this stage, the dataset may be ready for saving. For illustrative purposes, 
we will name the dataset cleaneddata.dta. Apart from a few variable 
labels, it is identical to mus202psid92m.dta, which is used in the book from 
section 2.5 on. 


The save command creates a Stata data file. For example, 


. * Save as Stata data file (also used in semiparametric regression chapter) 
. save cleaneddata, replace 

(file cleaneddata.dta not found) 

file cleaneddata.dta saved 


The replace option means that an existing dataset with the same name, if it 
exists, will be overwritten. The .dta extension is unnecessary because it is 
the default extension. 


Stata version 17 datasets can be read using version 16 or version 15. 
They can also be read by version 14, provided there are fewer than 32,768 
variables, but cannot be read by earlier versions of Stata. The saveold 
command saves a data file that can be read by even earlier versions of Stata. 
The saveold command can save for versions 11—13. For example, 


* Save as Stata data file readable by versions 12 and above 
saveold cleaneddata, version(12) replace 
(saving in Stata 12 format, which can be read by Stata 11 or 12) 
file cleaneddata.dta saved 


The output indicates that this dataset can actually be read by versions 11 and 
above. 


The data can also be saved in another format that can be read by 
programs other than Stata; see section 2.3.10 for a list of formats. 


The export delimited command allows saving data as a delimited text 
file in a spreadsheet format. The default is a comma-separated file. 


* Save as comma-separated values spreadsheet 
. export delimited age education eddummy* earnings di hours 
> using cleaneddata.csv, replace 
(file cleaneddata.csv not found) 
file cleaneddata.csv saved 


The wildcard * in eddummy expands to create eddummyl—eddummy4 per the 
rules for wildcards, given in section 1.3.5. The first two lines in 
cleaneddata.csv are 


age, education, eddummy1,eddummy2,eddummy3, eddummy4, earnings ,di,hours 
40,9,1,0,0,0,22000,1,2340 


A space-delimited formatted text file can also be created by using the 
out file command: 


. * Save as formatted text (ascii) file 
. outfile age education eddummy* earnings di hours using cleaneddata.asc, replace 
(file cleaneddata.asc not found) 


The first line in cleaneddata.asc is then 


40 9 1 (0) (0) 0 22000 
1 2340 


This file will take up a lot of space; less space is taken if the comma option is 
used. The dictionary option writes the file in Stata’s dictionary format, 
which includes at the start of the file a description of the formatting. 


2.4.9 Selecting the sample 


Most commands will automatically drop missing values in implementing a 
given command. We may want to drop additional observations, for example, 
to restrict analysis to a particular age group. 


This can be done by adding an appropriate if qualifier after the 
command. For example, if we want to summarize data for only those 
individuals 35—44 years old, then we type 


. * Select the sample used in a single command using the if qualifier 
. Summarize earnings lnearns if age >= 35 & age <= 44 


Variable Obs Mean Std. dev. Min Max 


earnings 2,114 30131.05 37660.11 (0) 999999 
lnearns 1,983 10.04658 .9001594 4.787492 13.81551 


Different samples are being used here for the two variables because for the 
131 observations with 0 earnings, we have data on earnings but not on 
l1nearns. The if qualifier uses logical operators, defined in section 1.3.6. 


However, for most purposes, we would want to use a consistent sample. 
For example, if separate earnings regressions were run in levels and in logs, 
we would usually want to use the same sample in the two regressions. 


The drop and keep commands allow sample selection for the rest of the 
analysis. The keep command explicitly selects the subsample to be retained. 
Alternatively, we can use the drop command, in which case the subsample 
retained is the portion not dropped. The sample dropped or kept can be 
determined by using an if qualifier or a variable list or by defining a range 
of observations. 


For the current example, we use 


* Select the sample using command keep 
. keep if (!missing(lnearns)) & (age >= 35 & age <= 44) 
(2,307 observations deleted) 


summarize earnings lnearns 


Variable Obs Mean Std. dev. Min Max 


earnings 1,983 32121.55 38053.31 120 999999 
lnearns 1,983 10.04658 .9001594 4.787492 13.81551 


This command keeps the data provided: 1nearns is nonmissing and 
35 < age < 44. Note that now earnings and lnearns are summarized for 


the same 1,983 observations. 


As a second example, the commands 


. * Select the sample using keep and drop commands 
. use cleaneddata, clear 


. keep lnearns age 


. drop in 1/1000 
(1,000 observations deleted) 


will lead to a sample that contains data on all but the first 1,000 observations 
for just the two variables 1nearns and age. The use cleaneddata command 
is added because the previous example had already dropped some of the 
data. 


If we want to speed up analysis of large datasets, especially exploratory 
analysis, it can be useful to work with a random subset of the data. The 
command keep if runiform()<0.1, for example, will keep a subsample of 
approximately 10% of the observations because the runiform() command 
provides pseudorandom draws of the uniform distribution on (0, 1). 
Alternatively, the command sample 10, for example, will draw a sample of 
exactly 10% of the observations; the option count instead enables drawing a 
sample with a specified number of observations. In all cases, one should first 
use the set seed command to ensure reproducibility. 


2.4.10 Time-series data 


For time-series data that include a time variable, the tsset command 
declares the data to be a time series, allowing the use of Stata’s time-series 
operators. 


For example, suppose data are monthly and the variable month is an 
integer that increases by one for each month. Then the command tsset 
month declares the data series to be a time series with time variable month. 


It is then possible to use time-series operators, such as lags, differences, 
and leads. For example, the command generate ylag = 1.y creates a 
variable that equals variable y in the preceding month. Similarly, 112. y 
denotes the 12th lag of y, or the value in the same month in the previous 
year, and a.y denotes the first difference or monthly change in variable y. If 
data are unavailable, then a missing value is generated. For example, for the 
first observation, it is not possible to create 1. y. 


A major complication is defining a time variable in a format that Stata 
recognizes because time-series datasets often store the date variable as a 
string variable. For example, suppose we have monthly data with variable 
month stored as “February 1, 1960” or as “2/1/1960”. This can be converted 
to a number using the date () function. In this example, we use generate 
date2 = date (month, "MDY") because the date string variable was ordered 
month, day, year. Stata normalizes dates as starting at 1/1/1960, so, for 
example, 2/1/1960 becomes 31. Because we have monthly data, we convert 
this to the number of months since 1960 using command generate date3 = 
mofd(date2). Variable date3 can be used immediately in a tsset command, 
but for proper dates to appear on graphs, we should give a date format. Here 
date3 1s months since 1960, so we give command format %tm date3 
because %tm is the format for monthly data. 


The particulars in the preceding example will change according to 
whether data are daily, weekly, monthly, quarterly, yearly, ... and the exact 
way that dates appear in the original data. For details, see [M-5] dateQ) and 
[D] Datetime. 


2.5 Manipulating datasets 


Useful manipulations of datasets include reordering observations or 
variables, temporarily changing the dataset but then returning to the original 
dataset, breaking one observation into several observations (and vice versa), 
and combining more than one dataset. We use mus202psid92m.dta obtained 
from PSID data on men aged 30-50 years in 1992. The dataset is identical to 
file cleaneddata.dta created by the preceding code, but with additional 
edits to variable labels. 


2.5.1 Ordering observations and variables 


Some commands, such as those using the by prefix, require sorted 
observations. The sort command orders observations in ascending order 
according to the variables in the command. The sort, stable command 
ensures that the previous ordering of tied values of the sort variables is 
maintained. The gsort command allows ordering to be in descending order. 


You can also reorder the variables by using the order command. This 
can be useful if, for example, you want to distribute a dataset to others with 
the most important variables appearing as the first variables in the dataset. 


2.5.2 Preserving and restoring a dataset 


In some cases, it is desirable to temporarily change the dataset, perform 
some calculation, and then return the dataset to its original form. An 
example involving the computation of marginal effects is presented in 
section 13.5.4. The preserve command preserves the data, and the restore 
command restores the data to the form they had immediately before 


preserve. 


* Commands preserve and restore illustrated 
. use mus202psid92m, clear 


list age in 1/1, noheader clean 
1. 40 


. preserve 


. replace age = age + 1000 
aD age was byte now int 
(4,290 real changes made) 


list age in 1/1, noheader clean 
1. 1040 


. restore 


list age in 1/1, noheader clean 
1. 40 


As desired, the data have been returned to original values. 
2.5.3 Data frames 


The preserve and restore commands essentially allow use of two datasets 
rather than one. The frame commands, introduced in Stata 16, enable 
switching between multiple datasets. 


One way to create a new data frame is to make a copy of an existing data 
frame. The following code first determines that the current data in memory 
have frame title default, renames this as first, and copies first to 


second. 


x Create a new data frame 
. frame 
(current frame is default) 


. frame rename default first 


. frame copy first second 


Next, we change the current frame to secona and change variable age. 
One feature of data frames is the frame prefix, which enables one to execute 
a command on data in a specified frame. In this case, while in frame secona, 
we list an observation in frame first. 


* Change to the new data frame and manipulate 
. frame change second 
. replace age = age + 1000 
variable age was byte now int 
(4,290 real changes made) 


. list age in 1/1, noheader clean 


Ls 1040 
. frame first: list age in 1/1, noheader clean 
1. 40 


We then return to the original dataset. 


. * Revert back to the original data frame 
. frame change first 


. list age in 1/1, noheader clean 
1. 40 


For the examples in this book, it is sufficient to use the preserve and 
restore commands, so we do not make use of the frame commands. 
Nonetheless, the ability to switch between data frames is a great 
enhancement to Stata’s data management capabilities that will be especially 
useful in preparing data ahead of implementation of the statistical methods 
that are the focus of this book. 


2.5.4 Collapsing and expanding datasets 


The collapse command collapses the dataset to a dataset of summary 
statistics. 


For example, we can create a dataset of the median values of earnings 
and age for each of the four education categories. 


. * Collapse to dataset of medians of earnings and age for each value of edcat 
. preserve 


. keep if !missing(earnings) & !missing(age) 
(209 observations deleted) 


. collapse (median) earnings age, by(edcat) 
. list, clean 


edcat earnings age 


1. 1 14000 38 

2. 2 22000 37 

3. 3 28000 38 

4. 4 41392.5 39 
. restore 


If the missing observations had not been dropped, then a fifth observation 
would have been created because of the missing values of education. 


The expand command creates duplicate observations. The number of 
duplicates can be a fixed number, such as 3, or be determined by a variable. 
For example, we expand the immediately preceding dataset of four 
observations, with duplicates determined by the value of edcat, provided 
edcat is less than four. 


. * Expand dataset with number of duplicate observations = edcat if edcat < 4 
. preserve 


. qui keep if !missing(earnings) & !missing(age) 
. qui collapse (median) earnings age, by(edcat) 
expand edcat if edcat < 4 
(3 observations created) 
list, clean 


edcat earnings age 


1 1 14000 38 
2 2 22000 37 
3 3 28000 38 
4 4 41392.5 39 
5 2 22000 37 
6 3 28000 38 
7 3 28000 38 
. restore 


The expandcl command duplicates clusters. 


2.5.5 Wide and long forms for a dataset 


Some datasets may combine several observations into a single observation. 
For example, a single household observation may contain data for several 
household members, or a single individual observation may have data for 
each of several years. This format for data is called wide form. If instead 
these data are broken out so that an observation is for a distinct household 
member, or for a distinct individual—year pair, the data are said to be in long 
form. 


The reshape command is detailed in section 8.10. It converts data from 
wide form to long form and vice versa. This is necessary if an estimation 
command requires data to be in long form, say, but the original dataset is in 
wide form. The distinction is important especially for analysis of panel data 
and multinomial data. 


2.5.6 Merging datasets 


The merge command combines two datasets to create a wider dataset; that is, 
new variables from the second dataset are added to existing variables of the 
first dataset. Common examples are data on the same individuals obtained 
from two separate sources that then need to be combined and data on 
supplementary variables or additional years of data. 


Merging two datasets involves adding information from a dataset on disk 
to a dataset in memory. The dataset in memory is known as the master 
dataset. 


Merging two datasets is straightforward if the datasets have the same 
number of observations and the merge is a line-to-line merge. Then line 10, 
for example, of one dataset is combined with line 10 of the other dataset to 
create a longer line 10. The merge 1:1 _n command performs this merge. 


We consider instead a match-merge, where observations in the two 
datasets are combined if they have the same values for one or more 
identifying variables that are used to determine the match. In either case, 
when a match is made if a variable appears in both datasets, the default is for 
the master dataset value to be retained unless it is missing, in which case it is 
replaced by the value in the second dataset. If a variable exists only in the 
second dataset, then it is added as a variable to the master dataset. 


To demonstrate a match-merge, we create two datasets from the dataset 
used in this chapter. The first dataset comprises every third observation with 
data on id, education, and earnings: 


. * Create first dataset with every third observation 
. use mus202psid92m, clear 


. keep if mod(_n,3) == 0 
(2,860 observations deleted) 


. keep id education earnings 
. list in 1/4, clean 


educat"n earnings id 


1. 16 38708 3 
2. 12 3265 6 
3. 11 19426 9 
4 11 30000 12 


. qui save mergel, replace 


The keep if mod(_n,3) == 0 command keeps an observation if the 
observation number (_n) is exactly divisible by 3, so every third observation 
is kept. Because id = xn for these data, by saving every third observation, 
we are saving observations with id equal to 3, 6, 9, .... 


The second dataset comprises every second observation with data on id, 


education, and hours: 


. * Create second dataset with every second observation 
. use mus202psid92m, clear 


. keep if mod(_n,2) == 0 
(2,145 observations deleted) 


. Keep id education hours 


. list in 1/4, clean 


educat~n hours id 
T; 12 2008.00 2 
2. 12 2200.00 4 
3. 12 552.00 6 
4 17 3750.00 8 


. qui save merge2, replace 


Now, we are saving observations with id equal to 2, 4, 6, .... 


Now, we merge the two datasets by using the merge 1:1 command. This 
requires that the identifying variable or variables for the match uniquely 
identify each observation in each dataset. 


In our case, the datasets differ in both the observations included and the 
variables included, though there is considerable overlap. We perform a 1:1 
match-merge on id to obtain 


. * Merge 1:1 two datasets with some observations and variables different 
clear 


. use mergel 
sort id 


. merge 1:1 id using merge2 


Result Number of obs 
Not matched 2,145 
from master 715 (_merge==1) 
from using 1,430 (_merge==2) 
Matched 715 (_merge==3) 
sort id 


. list in 1/4, clean 


educat"n earnings id hours _merge 
1. 12 ‘ 2 2008.00 Using only (2) 
2. 16 38708 3 Master only (1) 
3. 12 . 4 2200.00 Using only (2) 
4. 12 3265 6 552.00 Matched (3) 


Recall that observations from the master dataset have ia equal to 3, 6, 9, 
... and observations from the second dataset have ia equal to 2, 4, 6, .... 
Data for education and earnings are always available because they are in 
the master dataset. But observations for hours come from the second 
dataset; they are available when id is 2, 4, 6, ... and are missing otherwise. 


The same result is obtained if the roles of mergel.dta and merge2.dta 
are reversed. The sort commands are unnecessary; they simply ensure that 
observations in the merged dataset are ordered by id. For simplicity, we 
matched on just one variable, ia, but one can match on several variables. 


The merge command creates a variable, merge, that takes on a value of 
1 if the variables for an observation all come from the master dataset, a value 


of 2 if they all come from only the second dataset, and a value of 3 if for an 
observation some variables come from the master and some from the second 
dataset. After using merge, you should check that the number of observations 
for each value of merge matches your expectations. 


There are several options when using merge. The update option varies 
the action merge takes when an observation is matched. By default, the 
master dataset is held inviolate—if update is specified, values from the 
master dataset are retained if the same variables are found in both datasets. 
However, the values from the merging dataset are used in cases where the 
variable is missing in the master dataset. The replace option, allowed only 
with the update option, specifies that even if the master dataset contains 
nonmissing values, they are to be replaced with corresponding values from 
the merging dataset when corresponding values are not equal. A nonmissing 
value, however, will never be replaced with a missing value. 


A common type of merge that arises is when one dataset has one 
observation per id, where id is the key variable for the match, and the other 
dataset can have multiple observations per id. This merge can be done using 
either an appropriate one-to-many match or an appropriate many-to-one 
match. 


Suppose mergea.dta has observations uniquely identified by ia, while 
mergeb.dta can have multiple observations per id. Then we can give 
commands 


. use mergea 
. merge 1:m id using mergeb 


where ia needs to uniquely identify observations in the master file. The 
same merged dataset can be obtained using the commands 


. use mergeb 
. merge m:1 id using mergea 


where now ia needs to uniquely identify observations in the using file. 


2.5.7 Appending datasets 


The append command creates a longer dataset, with the observations from 
the second dataset appended after all the observations from the first dataset. 
If the same variable has different names in the two datasets, the variable 
name in one of the datasets should be changed by using the rename 
command so that the names match. 


. * Append two datasets with some observations and variables different 
. clear 


. use mergel 
. append using merge2 
. sort id 


. list in 1/5, clean 


educat"n earnings id hours 
1. 12 : 2 2008.00 
2. 16 38708 3 . 
3. 12 4 2200.00 
4. 12 . 6 552.00 
5. 12 3265 6 


Now, merge2.dta is appended to the end of merge1.dta. The combined 
dataset has observations 3,6,9,...,4290 followed by observations 
2,4,6,...,4290. We then sort on id. Now both every second and every third 
observation is included, so after sorting, we have observations 

2,3, 4, 6,8,9,.... Note, however, that no attempt has been made to merge 
the datasets. In particular, for the observation with id = 6, the hours 
variable is missing in observation 4, and the earnings variable is missing in 
observation 5. This is because the hours variable is missing from the master 
dataset and the earnings variable is missing from the using dataset. There 
was no attempt to merge the data. 


In this example, to take full advantage of the data, we would need to 
merge the two datasets using the first dataset as the master, merge the two 
datasets using the second dataset as the master, and then append the two 
datasets. 


2.6 Graphical display of data 


Graphs visually demonstrate important features of the data. Different types 
of data require distinct graph formats to bring out these features. We 
emphasize methods for numerical data taking many values, particularly, 
nonparametric methods. 


2.6.1 Stata graph commands 


The Stata graph commands begin with the word graph (in some cases, this is 
optional) followed by the graph plottype, usually twoway. We cover several 
leading examples but ignore the plottypes bar and pie for categorical data. 


Example graph commands 


The basic graph commands are very short and simple to use. For example, 


. use mus202psid92m, clear 


. twoway scatter lnearns hours 


produces a scatterplot of Inearns on hours, shown in figure 2.1. Most graph 
commands support the if and in qualifiers, and some support weights. 
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Figure 2.1. A basic scatterplot of log earnings on hours 


In practice, however, customizing is often desirable. For example, we 
may want to display the relationship between 1nearns and hours by 
showing both the data scatterplot and the ordinary least-squares (OLS) fitted 
line on the same graph. Additionally, we may want to change the size of the 
scatterplot data points, change the width of the regression line, and provide a 
title for the graph. We type 


. * More advanced graphics command with two plots and with several options 
. graph twoway (scatter lnearns hours, msize(small)) 

> (lfit lnearns hours, lwidth(medthick)), 

> title("Scatterplot and OLS fitted line") 


The two separate components scatter and 1fit are specified separately 
within parentheses. Each of these commands is given with one option, after 
the comma but within the relevant parentheses. The msize (small) option 
makes the scatterplot dots smaller than the default, and the 

lwidth (medthick) option makes the oLs fitted line thicker than the default. 
The title () option for twoway appears after the last comma. The graph 
produced is shown in figure 2.2. 
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Figure 2.2. A more elaborate scatterplot of log earnings on hours 


We often use lengthy graph commands that span multiple lines to 
produce template graphs that are better looking than those produced with 
default settings. In particular, these commands add titles and rescale the 
points, lines, and axes to a suitable size because the graphs printed in this 
book are printed in a much smaller space than a full-page graph in landscape 
mode. These templates can be modified for other applications by changing 
variable names and title text. 


Saving and exporting graphs 


Once a graph is created, it can be saved. Stata uses the term save to mean 
saving the graph in Stata’s internal graph format, as a file with the .gph 
extension. This can be done by using the saving() option in a graph 
command or by typing graph save after the graph is created. When saved in 
this way, the graphs can be reaccessed and further manipulated at a later 
date. 


Two or more Stata graphs can be combined into a single figure by using 
the graph combine command. For example, we save the first graph as 
graphl.gph, save the second graph as graph2.gph, and type the command 


. * Combine graphs saved as graphi.gph and graph2.gph 
. graph combine graphi graph2 


(output omitted ) 
Section 3.2.8 provides an example. 


The Stata internal graph format (.gph) is not recognized by other 
programs, such as word processors. To save a graph in an external format, 
you would use the graph export command. For example, 


. * Save graph as a scalable vector graphic 
. graph export mygraph.svg 


(output omitted ) 


Various formats are available, including PostScript (.ps), Encapsulated 
PostScript (.eps), scalable vector graphics (.svg), Windows Enhanced 
Metafile (. emf), GIF (.gif), JPEG (.jpg), PDF (.pd£), Portable Network 
Graphics (.png), and TIFF(.tif). The best format to select depends in part 
on what word processor is used; some trial and error may be needed. 


Learning how to use graph commands 


The Stata graph commands are extremely rich and provide an exceptional 
range of user control through a multitude of options. 


A good way to learn the possibilities is to create a graph interactively in 
Stata. For example, from the menus, select Graphics > Twoway graph 
(scatter, line, etc.). In the Plots tab of the resulting dialog box, select 
Create..., choose Scatter, provide a Y variable and an X variable, and then 
click on Marker properties. From the Symbol drop-down list, change the 
default to, say, Triangle. Similarly, cycle through the other options, and 
change the default settings to something else. 


Once an initial graph is created, the point-and-click Stata Graph Editor 
allows further customizing of the graph, such as adding text and arrows 
wherever desired. This is an exceptionally powerful tool that we do not 
pursue here; for a summary, see [G-1] Graph Editor. The Graph Recorder 
can even save sequences of changes to apply to similar graphs created from 
different samples. 


Even given familiarity with Stata’s graph commands, you may need to 
tweak a graph considerably to make it useful. For example, any graph that 
analyzes the earnings variable using all observations will run into problems 
because one observation has a large outlying value of $999,999. Possibilities 
in that case are to drop outliers, plot with the yscale (log) option, or use log 
earnings instead. 


We find it easiest to work with lengthy template graph commands, such 
as that given at the beginning of this subsection, and modify these as needed. 


2.6.2 Box-and-whisker plot 


The graph box command produces a box-and-whisker plot that is a graphical 
way to display data on a single series. The boxes cover the interquartile 
range, from the lower quartile to the upper quartile. The whiskers, denoted 
by horizontal lines, extend to cover most of or all the range of the data. Stata 
places the upper whisker at the upper quartile plus 1.5 times the interquartile 
range, or at the maximum of the data if this is smaller. Similarly, the lower 
whisker is the lower quartile minus 1.5 times the interquartile range, or the 
minimum should this be larger. Any data values outside the whiskers are 
represented with dots. Box-and-whisker plots can be especially useful for 
identifying outliers. 


The essential command for a box-and-whisker plot of the hours variable 
is 
. * Simple box-and-whisker plot 
. graph box hours 


(output omitted) 


We want to present separate box plots of hours for each of four 
education groups by using the over () option. To make the plot more 
intelligible, we first provide labels for the four education categories as 
follows: 


. use mus202psid92m, clear 


. label define edtype 1 "< high school" 2 "High school" 3 "Some college" 
> 4 "College degree" 


. label values edcat edtype 


The scale (1.2) graph option is added for readability; it increases the size of 
text, markers, and line widths (by a multiple 1.2). The marker() option is 
added to reduce the size of quantities within the box, the ytitle() option is 
used to present the title, and the yscale (titlegap(*5)) option is added to 
increase the gap between the y-axis title and the tick labels. We have 


. * Box-and-whisker plot of single variable over several categories 
. graph box hours, over(edcat) scale(1.2) marker(1,msize(vsmall) ) 
> ytitle("Annual hours worked by education") yscale(titlegap(*5) ) 


The result is given in figure 2.3. The labels for edcat, rather than the 
values, are automatically given, making the graph much more readable. The 
filled-in boxes present the interquartile range, the intermediate line denotes 
the median, and data outside the whiskers appear as dots. For these data, 
annual hours are clearly lower for the lowest schooling group, and there are 
quite a few outliers. About 30 individuals appear to work in excess of 4,000 
hours per year. 
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Figure 2.3. Box-and-whisker plots of annual hours for four 
categories of educational attainment 


2.6.3 Histogram 


The probability mass function or density function can be estimated using a 
histogram produced by the histogram command. The command can be used 
with if and in qualifiers and with weights. The key options are width (#) to 
set the bin width, bin(#) to set the number of bins, start (#) to set the 
lower limit of the first bin, and discrete to indicate that the data are 
discrete. The default number of bins is min{ VN, 10 x In N/ 1n 10}. Other 
options overlay a fitted normal density (the normal option) or a kernel 
density estimate (the kdensity option). 


For discrete data taking relatively few values, there is usually no need to 
use the options. 


For continuous data or for discrete data taking many values, it can be 
necessary to use options because the Stata defaults set bin widths that are not 
nicely rounded numbers and the number of bins might also not be desirable. 
For example, the output from histogram lnearns States that there are 35 
bins, a bin width of 0.268, and a start value of 4.43. A better choice may be 


. * Histogram with bin width and start value set 
. histogram lnearns, width(0.25) start(4.0) scale(1.2) 
(bin=40, start=4, width=.25) 


Density 


4 6 8 10 12 14 
Inearns 


Figure 2.4. A histogram for log earnings 


2.6.4 Kernel density plot 


For continuous data taking many values, a better alternative to the histogram 
is a kernel density plot. This provides a smoother version of the histogram in 
two ways: First, it directly connects the midpoints of the histogram rather 
than forming the histogram step function. Second, rather than giving each 
entry in a bin equal weight, it gives more weight to data that are closest to 
the point of evaluation. 


Let f(x) denote the density. The kernel density estimate of f(x) at 
x= To iS 


N 
f(zo) = ak (FG) (2.1) 


where K(-) is a kernel function that places greater weight on points x; close 
to Xo. More precisely, K (z) is symmetric around zero, integrates to one, and 
either K(z) = 0 if |z| > zo (for some zo) or K(z) — 0 as z— œ. A 
histogram with a bin width of 2h evaluated at £o can be shown to be the 
special case K (z) = 1/2 if |z| < 1, and K (z) = 0 otherwise. 


A kernel density plot is obtained by choosing a kernel function, K (-); 
choosing a width, h; evaluating f (20) at a range of values of To; and 
plotting f(x») against these £o values. 


The kdensity command produces a kernel density estimate. The 
command can be used with if and in qualifiers and with weights. The 
default window width or bandwidth is h = 0.9m/n1/5, where m = min (sx, 
iqr,./1.349) and iqr, is the interquartile range of x. The bwidth(#) option 
allows a different width (h) to be specified, with larger choices of h leading 
to smoother density plots. The n(#) option changes the number of evaluation 
points, £o, from the default of min( NV, 50). Other options overlay a fitted 
normal density (the normal option) or a fitted ¢ density (the student (#) 
option). 


The default bandwidth is based on theory for when the underlying 
distribution is normal and the Gaussian kernel is used. Note that if the same 
default bandwidth is used, then the different kernels lead to different degrees 
of smoothing: 1) the cosine kernel gives the least smooth plot; 2) the epan2, 
biweight, triangle, rectangle, and parzen kernels give similar amounts 
of smoothing; and 3) the default epanechnikov and the gaussian kernels 
provide the smoothest plots. 


The default kernel function is the Epanechnikov, which sets 
K (z) = (3/4)(1 — 22/5)/ v5 if |z| < v5, and K (z) = 0 otherwise. The 
kernel () option allows other kernels to be chosen. The kernel (epan2) 
option sets K (z) = (3/4)(1 — 2?) if |z| < 1, and K (z) = 0 otherwise. The 
same results are obtained if the epan2 bandwidth is ,/5 times the default 
epanechnikov kernel. But if the same default bandwidth is used, then the 
two lead to different results. If a smoother kernel density plot is desired, then 
the default epanechnikov kernel with default bandwidth may be a good 
starting point. If a rougher plot is desired, then the epan2 with the default 
bandwidth may be a better starting point. 


From output not given, results from the command kdensity lnearns 
included a statement that the Epanechnikov kernel is used and the bandwidth 
equals 0.1227. To instead manually specify a bandwidth of 0.12, one 
overlaid by a fitted normal density, we type the command 


. * Kernel density plot with bandwidth set and fitted normal density overlaid 
. kdensity lnearns, bwidth(0.12) normal n(4000) scale(1.2) 


which produces the graph in figure 2.5. This graph shows that the kernel 
density is more peaked than the normal and is somewhat left skewed. 
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Figure 2.5. The estimated density of log earnings 


The following code instead presents a histogram overlaid by a kernel 
density estimate. The histogram bin width is set to 0.25, the kernel density 
bandwidth is set to 0.2 using the kdenopts () option, and the kernel density 
plot line thickness is increased using the 1width (medthick) option. Other 
options used here were explained in section 2.6.2. We have 


. * Histogram and nonparametric kernel density estimate 
histogram lnearns if lnearns > 0, width(0.25) kdensity 
kdenopts(bwidth(0.2) lwidth(medthick) ) 
plotregion(style(none)) scale(1.2) 
title("Histogram and density for log earnings") 
xtitle("Log annual earnings", size(medlarge)) xscale(titlegap(*5) ) 
ytitle("Histogram and density", size(medlarge)) yscale(titlegap(*5) ) 
bin=38, start=4.4308167, width=.25) 
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The result is given in figure 2.6. Both the histogram and the kernel 
density estimate indicate that the natural logarithm of earnings has a density 
that is mildly left skewed. A similar figure for the level of earnings is very 
right skewed. 
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Figure 2.6. Histogram and kernel density plot for log earnings 
2.6.5 Twoway scatterplots and fitted lines 


As we saw in figure 2.1, scatterplots provide a quick look at the relationship 
between two variables. 


For scatterplots with discrete data that take on few values, it can be 
necessary to use the jitter () option. This option adds random noise so that 
points are not plotted on top of one another; see section 17.5.4 for an 
example. 


It can be useful to additionally provide a fitted curve. Stata provides 
several possibilities for estimating a global relationship between y against x, 
where by “global” we mean that a single relationship is estimated for all 
observations, and then for plotting the fitted values of y against z. 


The twoway 1£it command does so for a fitted OLS regression line, the 
twoway qfit command for a fitted quadratic regression curve, and the 
twoway fpfit command for a curve fit by fractional polynomial regression. 
The related twoway commands 1fitci, qfitci, and fpfitci additionally 
provide confidence bands for predicting the conditional mean E'(y|2) (by 


using the stap option) or for forecasting of the actual value of y|x (by using 
the stdf option). 


For example, we may want to provide a scatterplot and fitted quadratic 
with confidence bands for the forecast value of y|x (the result is shown in 
figure 2.7): 


. * Two-way scatterplot and quadratic regression curve with 95% ci for ylx 
. twoway (qfitci lnearns hours, stdf) (scatter lnearns hours, msize(small)) 
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Figure 2.7. Twoway scatterplot and fitted quadratic with 
confidence bands 


2.6.6 Twoway scatterplots and locally fitted lines 


An alternative curve-fitting approach to assuming linear or quadratic 
relationships between y and z is to use nonparametric methods. These fit a 
local relationship between y and x, at many values Zo of x. By “local”, we 
mean that prediction of y at each evaluation point zo is based on weighted 
regression centered on data points in the neighborhood of Zo. 


There are several nonparametric methods. 


An easily understood example is a median-band plot. The range of x is 
broken into, say, 20 intervals; the medians of y and x in each interval are 
obtained; and the 20 medians of y are plotted against the 20 medians of z, 
with connecting lines between the points. The twoway mband command does 
this, and the related twoway mspline command uses a cubic spline to obtain 
a smoother version of the median-band plot. 


Local regression 


We consider the regression model y = m(x) + u, where zx is a scalar and the 
conditional mean function m(-) is not specified. The goal is to estimate m/(-) 


A local regression estimate of m(x) at x = Xo is a local weighted 
average of Yi, i = 1,..., N, that places great weight on observations for 
which 2; is close to Yo and little or no weight on observations for which 7; is 
far from Zo. Formally, 


where the weights w(x;, £o, h) sum over j to one and decrease as the 
distance between 2; and £o increases. The weight depends on the closeness 
of x; to Xo and on the parameter h, called a bandwidth parameter. Different 
methods use different weighting functions. 


A plot is obtained by choosing a weighting function, w(x;, £o, h); 
choosing a bandwidth, h; evaluating M(xo) at a range of values of Xo; and 
plotting (a9) against these Zo values. For example, provided N > 50, the 
Stata default for the 1poly command presented below is to evaluate at 50 
evenly spaced data points between the minimum and maximum values of zx. 
It may seem that there may be too few data points between each value of Xo 
to obtain a decent fit. This is avoided by using weights w(x;, xo, h) that 
average over much more than a small fraction of the data. Essentially, rolling 
windows are used to evaluate m(a,) at a range of values of xo. 


Local constant and local linear kernel-weighted plots 


Kernel regression at x = Zo uses the weight 


_  K{(æi —20)/h} 
1 K{(z; — 20)/h} 


w Bi Toh) 


where K(-) is a kernel function defined after (2.1). For example, the 

kernel (epan2) option sets K (z) = (3/4)(1 — z?) if |z| < 1, and K (z) = 0 
otherwise. Estimates depend crucially on the bandwidth chosen, with smaller 
h leading to more variable estimates of m(xo). 


The kernel regression estimate at x = £o can equivalently be obtained by 
minimizing 


which is weighted regression on a constant where the kernel weights are 
largest for observations with x; close to £o. Then m(aq) = Qo. This 
estimator is also called the (kernel-weighted) local constant estimator. 


The (kernel-weighted) local linear estimator of m/(-) additionally 
includes a slope coefficient and at x = £o minimizes 


N 


X w(zi, £o, h) x {yi — ao — bolsi — 20)}” (2.2) 


w=1 


Again, (xo) = Qo. This estimator has the advantage of better estimation of 
m(ao) at values of £o near the endpoints of the range of x because it allows 
for any trends near the endpoints. 


More generally, the local polynomial estimator of degree p uses a 
polynomial of degree p in (x; — xo) in (2.2). 


The Ipoly command 


The 1poly command implements local polynomial estimation. The 
command has syntax 


lpoly yvar xvar [ af | lin] | weight ] E options | 


The degree (#) option specifies the degree p, with local constant the 
default (p = 0) and local linear (p = 1) the most commonly used degrees. 
The kernel () option specifies the kernel (with default epanechnikov); the 
bwidth(#) option specifies the kernel bandwidth h; and the generate () 
option saves the evaluation points Xo and estimates M(xo). The default is to 
evaluate m(a) at min(N, 50) equally spaced values of xo between the 
minimum and maximum values of x. The at (varname) option instead 
evaluates at the distinct values of varname, usually at (x), where x is the 
regressor. 


The default bandwidth is a plugin estimator of the optimal bandwidth 
assuming a constant bandwidth; see [R] Ipoly. This default bandwidth is by 
no means perfect and in practice can undersmooth or oversmooth the data. 
The bandwidth then needs to be set by using the bwidth() option. 


The following example illustrates the relationship between log earnings 
and hours worked. We present a local constant curve, using the 
kernel (epan2) option, and with 95% confidence bands added using the ci 
option. Additional options, explained below, modify the appearance of the 
graph. 


. * Local constant with epan2 kernel and 95% confidence bands 
. use mus202psid92m, clear 


lpoly lnearns hours, kernel(epan2) ci msize(tiny) lwidth(medthick) 
plotregion(style(none)) xtitle("Annual hours", size(medlarge) ) 
title("Local constant smooth") scale(1.1) 
ytitle("Natural log of annual earnings", size(medlarge) ) 
legend(pos(4) ring(0) col(1)) legend(size(small1l) ) 
legend(label(1 "CI") label(2 "Actual data") label(3 "Local constant") ) 
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Figure 2.8. Local constant plot of log earnings against hours 


The resulting graph is presented in figure 2.8. The confidence bands are 
much narrower in regions where there are many more observations on hours 
because then more observations are used in calculating the local average. For 
hours in excess of 4,500, there are generally too few observations to 
compute the local constant estimator because only observations within 84.17 
hours (the bandwidth used) of the 50 equally spaced evaluation points are 
used in computation. 


The lowess command 


The locally weighted scatterplot smoothing estimator (lowess) is a variation 
of the local linear estimator that uses a variable bandwidth and tricubic 
kernel and downweights observations with large residuals (using a method 
that greatly increases the computational burden). 


This estimator is obtained by using the lowess command. The bandwidth 
gives the fraction of the observations used to calculate M(xo) in the middle 
of the data, with a smaller fraction used toward the endpoints. The default 
value of 0.8 can be changed by using the bwidth(#) option, so as with the 


local polynomial methods, a smoother plot is obtained by increasing the 
bandwidth. 


We create a scatterplot (scatter) with a fitted lowess curve (lowess), 
along with a local linear curve (1poly). The command is lengthy because of 
the detailed formatting commands used to produce a nicely labeled and 
formatted graph. 


The msize (tiny) option is used to decrease the size of the dots in the 
scatterplot. The lwidth (medthick) option is used to increase the thickness 
of lines, and the clstyle(p1) option changes the style of the line for 
lowess. The title() option provides the overall title for the graph. The 
xtitle() and ytitle() options provide titles for the x axis and y axis, and 
the size (medlarge) option defines the size of the text for these titles. The 
legend () options place the graph legend at four o’clock (pos (4) ) with text 
size small and provide the legend labels. We have 


. * Scatterplot with lowess and local linear nonparametric regression 

graph twoway (scatter lnearns hours, msize(tiny)) 
(lpoly lnearns hours, kernel(epan2) degree(1) clstyle(p1) lwidth(thick) 
bwidth(500)) (lowess lnearns hours, clstyle(p2) lwidth(thick)), 
plotregion(style(none)) title("Local linear and lowess fits") 
xtitle("Annual hours", size(medlarge)) scale(1.1) 
ytitle("Natural log of annual earnings", size(medlarge) ) 
legend(pos(4) ring(0) col(1)) legend(size(smal1) ) 
legend(label(1 "Actual data") label(2 "Local linear") label(3 "Lowess")) 
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Figure 2.9. Local linear and lowess plots of log earnings against 
hours 


The resulting graph is presented in figure 2.9. This command used the 
default bandwidth setting for 1owess and greatly increased the 1poly 
bandwidth from its automatically selected value of 84.17 to 500. Even so, 
the local linear curve is too variable at high hours where the data are sparse. 
At low hours, however, the lowess estimator overpredicts, while the local 
linear estimator does not. 


Figures 2.8 and 2.9 both indicate that log earnings increase with hours 
until about 2,500 hours and that a quadratic relationship may be appropriate. 


In summary, the 1poly and lowess commands provide local regression 
curves that are especially useful when overlain on a twoway scatterplot as in 
figure 2.9. For both commands, a larger bandwidth leads to a smoother 
curve. Both commands have a default plugin formula for determining the 
bandwidth, one that is printed on the resulting graph and stored as the scalar 
r (bwidth). The user may want to change this value using the bwidth () 
option to obtain curves that are more or less smooth. For the 1poly 
command, different kernel functions can be specified using the kernel () 
option. The choice of kernel is of secondary importance, but note that a 


given bandwidth size, such as bwidth (50), corresponds to different degrees 
of smoothness for different kernel functions, as already discussed for kernel 
density estimation in section 2.6.4. 


The local constant and local linear plots from the 1poly command can 
also be obtained using the npgraph command following the npregress 
kernel command. This is illustrated in section 14.6. The npregress kernel 
command is a richer command that is much more than a graphics command. 
It extends local constant and local linear regression to the case of multiple 
regressors and can be used, for example, to compute marginal effects for 
individual regressors in a model more flexible than the linear regression 
model; see chapter 27. 


2.6.7 Multiple scatterplots 


The graph matrix command provides separate bivariate scatterplots between 
several variables. Here we produce bivariate scatterplots (shown in 

figure 2.10) of Inearns, hours, and age for each of the four education 
categories: 


. * Multiple scatterplots 
. label variable age "Age" 


. label variable lnearns "Log earnings" 
. label variable hours "Annual hours" 


. graph matrix lnearns hours age, by(edcat) msize(tiny) 


1 2 


0.GD0Gt006000.00 0.C2D0G10066000.00 
ae i —49 poe oe 
Log j= +” TAMA aO Log -afii $ inanieniiis | 
earnings #3: ° Mairg earnings —_ ý nyt $0 
6000.00 4 can ee 6000.00 =>] x ae 
4000.00 4 ae Annual Joist toe ote 4000.004 sgis. Annual Ben arr ee 
2000.00- yx hours iil 2000.00 - 0 =| “hours iii 
0.00 + Htp: = PL rh IN H 0.00 4 «ir CHE RE 
~ ae. | Pe H50 "a, |e -50 
| p: 5 Age p40 a E 3> | Age b40 
AT tn J ay L30 1 i ey L30 
ee AD 1 a r 
6 8 1012 30 40 50 5 10 15 30 40 50 
3 4 
0.GD0GL006000.00 0.00 5000.00 
je et i — 15 
Log eee ee Log ee iniii] 
oats | ail inn + oals | aie lenin 
6000.00 + S : —+° 5000.00 - = 
4000.004 a£. Annual [pgs e, ie. | Annual fajita: 
2000.004 `, a hours idl unt ; F ; hours iii 
0.004" e LAMINEREN 0.004 _ = OE EE ee 
-E T. H50 E -50 
a_i É So ol H30 +30 
5 10 15 30 40 50 5 30 40 50 


Graphs by Recode of education (completed education) 


Figure 2.10. Multiple scatterplots of several variables for each 
level of education 


Stata does not provide three-dimensional graphs, such as that for a 
nonparametric bivariate density estimate or for nonparametric regression of 
one variable on two other variables. 


2.7 Additional resources 


The key data management references are [U] Stata User s Guide and [D] 
Stata Data Management Reference Manual. Useful online help categories 
include 1) double, string, and format for data types; 2) clear, use, 
import, infile, and outsheet for data input; 3) summarize, list, label, 
tabulate, generate, egen, keep, drop, recode, by, sort, merge, append, 
and collapse for data management; and 4) graph, graph box, histogram, 
kdensity, twoway, lpoly, lowess, and graph matrix for graphical analysis. 


The Stata graphics commands are quite flexible and Stata provides both 
an interactive Graph Editor and a Graph Recorder; see [G-1] Graph Editor. 
A Visual Guide to Stata Graphics by Mitchell (2022) provides many 
hundreds of template graphs with the underlying Stata code and an 
explanation for each. 


2.8 Exercises 


1. Type the command display %10.5f 123.321. Compare the results with 
those you obtain when you change the format 310.5£ to, respectively, 
%10.5e, $10.5g, 3-10.5£, and 310,5f and when you do not specify a 
format. 

2. Consider the example of section 2.3 except with the variables 
reordered. Specifically, the variables are in the order age, name, 
income, and female. The three observations are 29 "Barry" 40.990 
0; 30 "Carrie" 37.000 1; and 31 "Gary" 48.000 0. Use input to 
read these data, along with names, into Stata, and list the results. Use a 
text editor to create a comma-separated values file that includes 
variable names in the first line, read this file into Stata by using import 
delimited, and list the results. Then, drop the first line in the text file, 
read in the data by using import delimited with variable names 
assigned, and list the results. Finally, replace the commas in the text 
file with blanks, read the data in by using infix, and list the results. 

3. Consider the dataset in section 2.4. The er32049 variable is the last- 
known marital status. Rename this variable as marstatus, give the 
variable the label “marital status”, and tabulate marstatus. From the 
codebook, marital status is married (1), never married (2), widowed 
(3), divorced or annulment (4), separated (5), not answered or do not 
know (8), and no marital history collected (9). Set marstatus to 
missing where appropriate. Use label define and label values to 
provide descriptions for the remaining categories, and tabulate 
marstatus. Create a binary indicator variable equal to 1 if the last- 
known marital status is married and equal to 0 otherwise, with 
appropriate handling of any missing data. Provide a summary of 
earnings by marital status. Create a set of indicator variables for 
marital status based on marstatus. Create a set of variables that 
interact these marital status indicators with earnings. 

4. Consider the dataset in section 2.6. Create a box-and-whisker plot of 
earnings (in levels) for all the data and for each year of educational 
attainment (use variable education). Create a histogram of earnings 
(in levels) using 100 bins and a kernel density estimate. Do earnings in 
levels appear to be right skewed? Create a scatterplot of earnings 


against education. Provide a single figure that uses scatterplot, 
1fit, and lowess of earnings against education. Add titles for the 
axes and graph heading. 

. Consider the dataset in section 2.6. Create kernel density plots for 
lnearns using the kernel (epan2) option with kernel 

K(z) = (3/4)(1 — z?/5) for |z| < 1 and using the 

kernel (rectangle) option with kernel K(z) = 1/2 for |z| < 1. 
Repeat with the bandwidth increased from the default to 0.3. What 
makes a bigger difference, choice of kernel or choice of bandwidth? 
The comparison is easier if the four graphs are saved using the 
saving() option and then combined using the graph combine 
command. 

. Consider the dataset in section 2.6. For each of the available kernels 
that can be used with the kdensity command, obtain a kernel density 
plot for 1nearns using the default bandwidth, and save the graph using 
the saving() option. Then, combine all graphs on one page using the 
graph combine command and options such as rows (4) ysize (8) 
xsize(5). Comment on the relative smoothness of the various graphs. 
. Consider the dataset in section 2.6. Perform lowess regression of 
lnearns ON hours using the default bandwidth and using bandwidth of 
0.01. Does the bandwidth make a difference? A moving average of y 
after data are sorted by zx is a simple case of nonparametric regression 
of y on x. Sort the data by hours. Create a centered 25-period moving 
average of lnearns with jth observation yma; = 1/25 Ja iia 
This is easiest using forvalues. Plot this moving average against 
hours using the twoway connected graph command. Compare with the 
lowess plot. 


Chapter 3 
Linear regression basics 


3.1 Introduction 


Linear regression analysis is often the starting point of an empirical 
investigation. Because of its relative simplicity, it is useful for illustrating 
the different steps of a typical modeling cycle that involves an initial 
specification of the model followed by estimation, diagnostic checks, and 
model respecification. The purpose of such a linear regression analysis may 
be to summarize the data, generate conditional predictions, or test and 
evaluate the role of specific regressors. We will illustrate these aspects 
using a specific data example. 


This chapter is limited to basic linear regression analysis on cross- 
sectional data of a continuous dependent variable. The setup is for a single 
equation and exogenous regressors. Some standard complications of linear 
regression, such as misspecification of the conditional mean and model 
errors that are heteroskedastic, will be considered. In particular, we model 
the natural logarithm of medical expenditures instead of the level. We will 
ignore other various aspects of the data that can lead to more sophisticated 
nonlinear models presented in later chapters. 


3.2 Data and data summary 


The first step is to decide what dataset will be used. In turn, this decision 
depends on the population of interest and the research question itself. We 
discussed how to convert a raw dataset to a form amenable to regression 
analysis in section 2.4. In this section, we present ways to summarize and 
gain some understanding of the data, a necessary step before any regression 
analysis. 


3.2.1 Data description 


We analyze medical expenditures in 2003 of individuals 65 years and older 
who qualify for healthcare under the U.S. Medicare program. The original 
data source is the Medical Expenditure Panel Survey. 


Medicare does not cover all medical expenses. For example, copayments 
for medical services and expenses of prescribed pharmaceutical drugs were 
not covered for the time period studied here. About half of eligible 
individuals therefore purchase supplementary insurance in the private market 
that provides insurance coverage against various out-of-pocket expenses. 


In this chapter, we consider the impact of this supplementary insurance 
on total annual medical expenditures of an individual, measured in dollars. A 
formal investigation must control for the influence of other factors that also 
determine individual medical expenditure, notably, sociodemographic 
factors such as age, gender, education and income, geographical location, 
and health-status measures such as self-assessed health and presence of 
chronic or limiting conditions. In this chapter, as in other chapters, we 
instead deliberately use a short list of regressors. This permits shorter output 
and simpler discussion of the results, an advantage because our intention is 
to simply explain the methods and tools available in Stata. 


3.2.2 Variable description 


Given the Stata dataset for analysis, we begin by using the describe 
command to list various features of the variables to be used in the linear 


regression. The command without a variable list describes all the variables in 
the dataset. Here we restrict attention to the variables used in this chapter. 


. * Variable description for medical expenditure dataset 
. use mus203mepsmedexp 
(A.C.Cameron & P.K.Trivedi (2022): Microeconometrics Using Stata, 2e) 


. describe totexp ltotexp posexp suppins phylim actlim totchr age female income 


Variable Storage Display Value 

name type format label Variable label 
totexp double %12.0g Total medical expenditure 
ltotexp float 7%9.0g ln(totexp) if totexp > 0 
posexp float %9.0g posexp Total expenditure > 0 
suppins float 29 .0g suppins Has supp priv insurance 
phylim double %12.0g phylim Has functional limitation 
actlim double %12.0g actlim Has activity limitation 
totchr double %12.0g # of chronic problems 
age double %12.0g Age 
female double %12.0g female Female 
income double {%12.0g Annual household income/1000 


The variable types and format columns indicate that all the data are numeric. 
In this case, some variables are stored in single precision (float) and some 
in double precision (double). From the variable labels, we expect totexp to 
be nonnegative; 1totexp to be missing if totexp equals 0; posexp, suppins, 
phylim, actlim, and female to be 0 or 1; totchr to be a nonnegative 
integer; age to be positive; and income to be nonnegative or positive. Note 
that the integer variables could have been stored much more compactly as 
integer or byte. The variable labels provide a short description that is 
helpful but may not fully describe the variable. For example, the key 
regressor suppins was created by aggregating across several types of private 
supplementary insurance. 


3.2.3 Summary statistics 


It is essential in any data analysis to first check the data by using the 
summarize command. 


. * Summary statistics for medical expenditure dataset 
. summarize totexp ltotexp posexp suppins phylim actlim totchr age female income 


Variable Obs Mean Std. dev. Min Max 
totexp 3,064 7030.889 11852.75 (0) 125610 
ltotexp 2,955 8.059866 1.367592 1.098612 11.74094 
posexp 3,064 . 9644256 . 1852568 (0) 1 
suppins 3,064 .5812663 . 4934321 (0) 1 
phylim 3,064 . 4255875 . 4945125 (0) 1 
actlim 3,064 . 2836162 . 4508263 (0) 1 
totchr 3,064 1.754243 1.307197 (0) T 
age 3,064 74.17167 6.372938 65 90 
female 3,064 .5796345 . 4936982 (0) 1 
income 3,064 22.47472 22.53491 -1 312.46 


On average, 96% of individuals incur medical expenditures during a 
year; 58% have supplementary insurance; 43% have functional limitations; 
28% have activity limitations; and 58% are female because the elderly 
population is disproportionately female because of the greater longevity of 
women. The only variable to have missing data is 1totexp, the natural 
logarithm of totexp, which is missing for the (3064 — 2955) = 109 
observations with totexp = 0. 


All variables have the expected range, except that income is negative. To 
see how many observations on income are negative, we use the tabulate 
command, restricting attention to nonpositive observations to limit output. 


* Tabulate variable 
. tabulate income if income <= 0 


Annual 
household 
income/1000 Freq. Percent Cum. 
ail 1 1.14 1.14 
0 87 98.86 100.00 
Total 88 100.00 


Only one observation is negative, and negative income is possible for 
income from self-employment or investment. We include the observation in 
the analysis here, though checking the original data source may be 
warranted. 


Much of the subsequent regression analysis will drop the 109 
observations with 0 medical expenditures, so in a research article, it would 
be best to report summary statistics without these observations. 


3.2.4 More detailed summary statistics 


Additional descriptive analysis of key variables, especially the dependent 
variable, is useful. For totexp, the level of medical expenditures, 
summarize, detail yields 


. * Detailed summary statistics of a single variable 
. Summarize totexp, detail 


Total medical expenditure 


Percentiles Smallest 

1% (0) (0) 

5% 112 (0) 
10% 393 (0) Obs 3,064 
25% 1271 (0) Sum of wgt. 3,064 
50% 3134.5 Mean 7030.889 
Largest Std. dev. 11852.75 

75% 7151 104823 
90% 17050 108256 Variance 1.40e+08 
95% 27367 123611 Skewness 4.165058 
99% 62346 125610 Kurtosis 26.26796 


Medical expenditures vary greatly across individuals, with a standard 
deviation of 11,853, which is almost twice the mean. The median of 3,135 is 
much smaller than the mean of 7,031, reflecting the skewness of the data. 
For variable z, the skewness statistic is a scale-free measure of skewness 
that estimates E[{(x — )/o}?] = E{ (x — p)?}/o°%, the third central 
moment standardized by the cube of the standard deviation. The skewness is 
zero for symmetrically distributed data. The value here of 4.17 indicates 
considerable right skewness. The kurtosis statistic is an estimate of 

El{(x — p)/o}4] = E{(x — p)*}/o+, the fourth central moment 
standardized by the fourth power of the standard deviation. The reference 
value is 3, the value for normally distributed data. The much higher value 
here of 26.27 indicates that the tails are much thicker than those of a normal 
distribution. You can obtain additional summary statistics by using the 


centile command to obtain other percentiles and by using the table 
command, which is explained in section 3.2.6. 


We conclude that the distribution of the dependent variable is 
considerably skewed and has thick tails. These complications often arise for 
commonly studied individual-level economic variables such as expenditures, 
income, earnings, wages, and house prices. It is possible that including 
regressors will eliminate the skewness, but in practice, much of the variation 
in the data will be left unexplained (R2 < 0.3 is common for individual- 
level data), and skewness and excess kurtosis will remain. 


Such skewed, thick-tailed data suggest a model with multiplicative errors 
instead of additive errors. A standard solution is to transform the dependent 
variable by taking the natural logarithm. Here this is complicated by the 
presence of 109 0-valued observations. We take the expedient approach of 
dropping the zero observations from analysis in either logs or levels. This 
should make little difference here because only 3.6% of the sample is then 
dropped. A better approach, using two-part or selection models, is covered in 
sections 19.5—19.7. 


The output for tabstat in section 3.2.6 reveals that taking the natural 
logarithm for these data essentially eliminates the skewness and excess 
kurtosis. 


The community-contributed fsum command (Wolfe 2002) is an 
enhancement of summarize that enables formatting the output and including 
additional information such as percentiles and variable labels. The 
community-contributed out sum command (Papps 2006) produces a text file 
of means and standard deviations for one or more subsets of the data, for 
example, one column for the full sample, one for a male subsample, and one 
for a female subsample. 


3.2.5 Tables of frequencies 


One-way tables can be created by using the tabulate command, presented 
in section 3.2.3, the table command, and the tabstat command. Two-way 
tables can also be created by using these commands. 


For two-way tables of frequencies, only table produces clean output. 
For example, 


* Two-way table of frequencies 
. table female totchr 


# of chronic problems 


(0) 1 2 3 4 5 6 T Total 

Female 
No 239 415 323 201 82 23 4 1 1,288 
Yes 313 466 493 305 140 46 11 2 1,776 


w 


3,064 


Total 552 881 816 506 222 69 15 


provides frequencies for a two-way tabulation of gender against the number 
of chronic conditions. The option stat (percent) provides percentages 
rather than frequencies. 


The tabulate command can provide both row and column percentages. 
For example, 


. * Two-way table with row and column percentages and Pearson chi-squared 
. tabulate female suppins, row col chi2 


Key 


frequency 
row percentage 
column percentage 


Has supp priv 
insurance 

Female No Yes Total 
No 488 800 1,288 
37.89 62.11 100.00 
38.04 44.92 42.04 
Yes 795 981 1,776 
44.76 55.24 100.00 
61.96 55.08 57.96 
Total 1,283 1,781 3,064 
41.87 58.13 100.00 
100.00 100.00 100.00 


Pearson chi2(1) = 14.4991 Pr = 0.000 


Comparing the row percentages for this sample, we see that while a woman 
is more likely to have supplemental insurance than not, the probability that a 
woman in this sample has purchased supplemental insurance is lower than 
the probability that a man in this sample has purchased supplemental 
insurance. Although we do not have the information to draw these inferences 
for the population, the results for Pearson’s chi-squared test soundly reject 
the null hypothesis that these variables are independent. Other tests of 
association are available. The related command tab2 will produce all 
possible two-way tables that can be obtained from a list of several variables. 


For multiway tables, it is best to use table. For the example at hand, we 
have 


* Three-way table of frequencies 
. table female suppins totchr, nototals 


# of chronic problems 
0 1 2 3 4 5 6 7 
Female 
No 
Has supp priv insurance 
No 102 165 121 68 25 6 1 
Yes 137 250 202 133 57 17 3 1 
Yes 
Has supp priv insurance 
No 135 212 233 134 56 22 1 2 
Yes 178 254 260 171 84 24 10 


An alternative is to use tabulate with the by prefix, but the results are not as 
neat as those from table. 


3.2.6 Tables of summary statistics 


The preceding tabulations will produce voluminous output if one of the 
variables being tabulated takes on many values. Then it is much better to use 
command table with the statistics () option to present tables that give 
key summary statistics for that variable, such as the mean and standard 
deviation. Note that the statistics() option, abbreviated stat (), was 
introduced in Stata 17 and replaces the contents () option available in 
earlier versions of Stata. Such tabulations can be useful even when variables 
take on few values. For example, when summarizing the number of chronic 
problems by gender, table yields 


. * One-way table of summary statistics 
. table (result) female, stat(count totchr) stat(mean totchr) stat(sd totchr) 
> stat (p50 totchr) 


Female 
No Yes Total 
Number of nonmissing values 1,288 1,776 3,064 
Mean 1.659938 1.822635 1.754243 
Standard deviation 1.261175 1.335776 1.307197 


50th percentile 1 2 2 


Women on average have more chronic problems (1.82 versus 1.66 for men). 
The option stat () can produce many other statistics, including the 
minimum, maximum, and key percentiles. 


The table command with the stat () options can additionally produce 
two-way and multiway tables of summary statistics. As an example, 


* Two-way table of summary statistics 
. table female suppins, stat(count totchr) stat(mean totchr) nototals 


Has supp priv insurance 
No Yes 
Female 

No 
Number of nonmissing values 488 800 
Mean 1.530738 1.73875 

Yes 
Number of nonmissing values 795 981 
Mean 1.803774 1.83792 


shows that those with supplementary insurance on average have more 
chronic problems. This is especially so for males (1.74 versus 1.53). 


The tabulate, summarize() command can be used to produce one-way 
and two-way tables with means, standard deviations, and frequencies. This is 
a small subset of the statistics that can be produced using table, so we might 
as well use table. 


The tabstat command provides a table of summary statistics that 
permits more flexibility than summarize. The following output presents 
summary statistics on medical expenditures and the natural logarithm of 
expenditures that are useful in determining skewness and kurtosis. 


. * Summary statistics obtained using command tabstat 
. tabstat totexp ltotexp, statistics(count mean p50 sd skew kurt) 
> columns (statistics) 


Variable N Mean p50 SD Skewness Kurtosis 


totexp 3064 7030.889 3134.5 11852.75 4.165058 26.26796 
ltotexp 2955 8.059866 8.111928 1.367592 -.3857887 3.842263 


This reproduces information given in section 3.2.4 and shows that taking the 
natural logarithm eliminates most skewness and kurtosis. The 

columns (statistics) option presents the results with summary statistics 
being given in the columns and each variable being given in a separate row. 
Without this option, we would have summary statistics in rows and variables 
in the columns. A two-way table of summary statistics can be obtained by 
using the by () option. 


The collect command, introduced in Stata 17, provides great flexibility 
in creating production-quality tables. The command is illustrated in 
section 3.5.7. 


3.2.7 Hypothesis tests on the population mean 


The ttest command can be used to test hypotheses about the population 
mean of a single variable (Hp: u = u* for specified value u*) and to test the 
equality of means (Ho: pı = u2). For more general analysis of variance and 
analysis of covariance, the oneway and anova commands can be used, and 
several other tests exist for more specialized examples such as testing the 
equality of proportions. 


These commands are rarely used in microeconometrics because they can 
be recast as a special case of regression with an intercept and appropriate 
indicator variables. Furthermore, regression has the advantage of reliance on 
less restrictive distributional assumptions, provided samples are large 
enough for asymptotic theory to provide a good approximation. 


For examples of the ttest command and comparison with tests based on 
OLS estimation, see section 3.5.12. 


3.2.8 Data plots 
It is useful to plot a smoothed histogram or a density estimate of the 


dependent variable. Here we use the kdensity command, which provides a 
kernel estimate of the density. 


The data are highly skewed, with a 97th percentile of approximately 
$40,000 and a maximum of $125,000. The kdensity totexp command will 
therefore bunch 97% of the density in the first 30% of the x axis. One 
possibility is to type kdensity totexp if totexp < 40000, but this produces 
a kernel density estimate assuming the data are truncated at $40,000. Instead, 
we use command kdensity totexp, we save the evaluation points in kx1 and 
the kernel density estimates in kd1, and then we line-plot kd1 against kx1. 


We do this for both the level and the natural logarithm of medical 
expenditures, and we use graph combine to produce a figure that includes 
both density graphs (shown in figure 3.1). We have 


. * Kernel density plots with adjustment for highly skewed data 
. kdensity totexp if posexp==1, generate(kx1i kd1) n(500) 


. graph twoway (line kd1 kx1) if kx1 < 40000, name(levels, replace) 

. label variable ltotexp "Natural logarithm of expenditure" 

. kdensity ltotexp if posexp==1, generate(kx2 kd2) n(500) 

. graph twoway (line kd2 kx2) if kx2 < 1n(40000), name(logs, replace) 
. graph combine levels logs, iscale(1.2) ysize(2.5) xsize(6.0) 
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Figure 3.1. Comparison of densities of level and natural logarithm 
of medical expenditures 


Only positive expenditures are considered, and for graph readability, the 
very long right tail of totexp has been truncated at $40,000. In figure 3.1, 
the distribution of totexp is very right skewed, whereas that of 1totexp is 
fairly symmetric. 


3.3 Transformation of data before regression 


When one specifies a linear regression model, the presumption is that the 
specified relationship between the variable of interest y and the regressors x 
is linear, which means that the marginal response of y to a unit change in x 
is constant. 


The preferred model linking y and the regressors, however, may not be 
linear. For example, the relation between total production costs and output 
is usually specified to be nonlinear. In such cases, it is usual to interpret the 
regression as linear after transformation from the original units. 
Transformations to the linear form may involve both y and x, or just one of 
those components. 


The purpose of the transformation is to “straighten out” a relationship. 
Consider some leading examples. Suppose that the relationship takes the 
form y = exp( bı + Box + u), where x denotes the regressor and u is the 
error term. Then the transformation In y = 61 + 62x + u is a “semilog” or 
“log-linear” regression that relates In y to x. After transformation, 82 
measures OF (In y)/Ox = (1/y) x Oy/Ozx, which varies inversely with y. 


Now consider the multiplicative relationship y = ef! 72y. Taking logs 
on both sides of the equality yields ln y = 6, + G2 Inx + Inu, a linear-in- 
logs or log-log regression. In this case, the coefficient 3; measures 
OE {In(y)/O1ln x}, that is, the elasticity of y with respect to x based on the 
constant elasticity model. 


In both the preceding examples, while the dependent variable and, in the 
second example, the regressor have been transformed, the transformed 
models are linear in the parameters. So the transformed models can be fit 
using OLS regression, the subject of this chapter. 


More generally, we may consider a regression such as 
g(y) = f(x, 8) + u, where g(-) and f(-) denote some linearizing 
transformation whose specific form will depend upon the context. Choosing 
the functions that provide the best approximations to the data-generating 


process (DGP) is a part of model specification. Having chosen one, one 
relies on statistical tests to check whether the functional form is such that 
the remaining unexplained variation is roughly random. 


Although the least-squares estimator of the linear regression requires 
only the data, or the error on the regression, to have quite weak 
distributional properties, transformations are often motivated by a 
preference for some particular features. For example, some outcomes such 
as income and expenditure often display a highly skewed distribution. A log 
transformation will typically make the distribution more symmetric and less 
nonnormal. 


Another motivation for transformation is to make the error variance less 
heteroskedastic. For example, in its original form, a regression may display 
dependence between (say) the location parameter E'(y|x) and scale 
parameter Var(y|x); a transformation may get rid of such dependence by 
reducing the heteroskedasticity of the error term. A family of power 
transformations, known as Box—Cox transformations, that replaces y by y? 
is motivated by a similar consideration. A special case is p = 1/2, the 
square-root transformation. In a third example, suppose y is positive and we 
want to ensure that fitted values of y remain positive. A log transformation 
ensures this. In the final example, suppose y is a proportion, that is, 

0 < y < 1, and again we want the fitted values from the regression to 
preserve this property, whereas the linear regression of y on x will not. The 
logit transformation uses the transformed dependent variable 

log{y/(1 — y)}, which satisfies this requirement. This transformation also 
changes the range of values of the dependent variable, producing greater 
symmetry and spread in the tails of the distribution. In some cases, such 
changes make the least-squares estimator more robust. 


Transformations generally affect the interpretations of regression 
coefficients, and transformations involving the dependent variables will 
also affect measures of goodness of fit such as regression R2. This means 
that regression statistics such as R? with g(y) as a dependent variable 
cannot be directly compared with those with y as the dependent variable. 
This complicates the comparison of regressions with different 
transformations of the dependent variable. A substantial literature exists on 


the topic of comparison of linear and linear-in-logs regressions; see Godfrey 
and Wickens (1981) and references cited there. 


Finally, even if one chooses to regress g(y) on h(x), one may want to 
interpret the results in terms of the original units of y and x. This involves a 
thorny problem of retransformation that is discussed in section 4.2.3. In 
some cases, retransformation can be avoided by directly modeling ¥ using 
methods more advanced than OLS regression. In particular, we can use 
Poisson regression (poisson command) in place of the log-linear model and 
use logit regression (Logit command) for proportions data. 


Economic theory rarely suggests a specific parametric form of a 
regression model, thereby leaving room for empirical explorations. 
Nonparametric regressions (see section 14.6) are less restrictive in this 
respect. 


3.4 Linear regression 


We present the linear regression model, first in levels and then for a 
transformed dependent variable, here in logs. 


3.4.1 Basic regression theory 


We begin by introducing terminology used throughout the rest of this book. 
Let o denote the vector of parameters to be estimated, and let g denote an 
estimator of Ø. Ideally, the distribution of g is centered on 9 with small 
variance, for precision, and a known distribution, to permit statistical 
inference. We restrict analysis to estimators that are consistent for @, 
meaning that in infinitely large samples, g equals @ aside from negligible 
random variation. This is denoted by g 2, g or, more formally, by 9 2 Bo 
where 0o denotes the unknown “true” parameter value. A necessary 
condition for consistency is correct model specification or, in some leading 
cases, correct specification of key components of the model, most notably 
the conditional mean. 


Under additional assumptions, most of the estimators considered in this 
book are asymptotically normally distributed, meaning that their 
distribution is well approximated by the multivariate normal in large 
samples. This is denoted by 


G4 N fo, Var (8) } 


where Var(6) denotes the (asymptotic) variance—covariance matrix of the 
estimator (VCE). More efficient estimators have smaller vcEs. The VCE 
depends on unknown parameters, so we use an estimate of the vce, denoted 
by V (6). Standard errors of the parameter estimates are obtained as the 
square root of diagonal entries in V (6). Different assumptions about the 
DGP, such as heteroskedasticity, can lead to different estimates of the VCE. 


Test statistics based on asymptotic normal results lead to the use of the 
standard normal distribution and chi-squared distribution to compute 
critical values and p-values. For some estimators, notably, the OLS estimator, 
tests are instead based on the ¢ distribution and the F distribution. This 
makes essentially no difference in large samples with, say, degrees of 
freedom greater than 100, but in practice it provides a better approximation 
especially for cluster—robust inference with few clusters; see section 3.4.6. 


3.4.2 OLS regression and matrix algebra 


The goal of linear regression is to estimate the parameters of the linear 
conditional mean 


E(y|x) = x'B = Bir, + Borg +--+ + KTK (3.1) 


where usually an intercept is included so that x; = 1. Herexisa K x 1 
column vector with the jth entry—the jth regressor 7;—and Gisa kK x 1 
column vector with the jth entry 6;. 


Sometimes, F(y|x) is of direct interest for prediction. More often, 
however, econometrics studies are interested in one or more of the 
associated marginal effects (MEs), 


OE(y|x) 


Oe E fj 


for the jth regressor. For example, we are interested in the MEs of 
supplementary private health insurance on medical expenditures. An 
attraction of the linear model is that estimated MEs are given directly by 
estimates of the slope coefficients. 


The linear regression model specifies an additive (often specified to be 
independent and identically distributed) error so that, for the typical ith 
observation, 


= Bis. 7S eV 


The OLS estimator minimizes the sum of squared errors, y (yi — x48)’. 


Matrix notation provides a compact way to represent the estimator and 
variance matrix formulas that involve sums of products and cross products. 
We define the N x 1 column vector y to have the 7th entry yi, and we 
define the N x K regressor matrix X to have the jth row x’. Then the OLs 
estimator can be written in several ways, with 
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We define all vectors as column vectors, with a transpose if row vectors 
are desired. By contrast, Stata commands and Mata commands define 
vectors as row vectors, so in parts of Stata and Mata code, we need to take a 
transpose to conform to the notation in the book. 


3.4.3 Properties of the OLS estimator 
The properties of any estimator vary with the assumptions made about the 


DGP. For the linear regression model, this reduces to assumptions about the 
regression error ui. 


The starting point for analysis is to assume that u; satisfies the 
following classical conditions: 


1. E(u;|x;) = 0 (exogeneity of regressors) 
2. E(u?|x;) = a? (conditional homoskedasticity) 
3. E(uzu;|xi,x;) = 0,2 A j (conditionally uncorrelated observations) 


Assumption | is essential for consistent estimation of 6 and implies that 
the conditional mean given in (3.1) is correctly specified. This means that 
the conditional mean is linear and that all relevant variables have been 
included in the regression. Assumption 1 is relaxed in chapter 7. 
Assumptions 2 and 3 determine the form of the vce of 8. 


3.4.4 Default standard errors 


Assumptions 1-3 lead to B being asymptotically normally distributed with 
the default estimator of the VCE 


Viste (a) = s*(X’X) 7! 


where 
N 
°? =(N-K)'S a (3.2) 


and q; = yi — x! 3. Under assumptions 1—3, the OLS estimator is fully 
efficient. If, additionally, u; is normally distributed, then “¢ statistics” are 
exactly ¢ distributed. This fourth assumption is not made, but it is common 
to continue to use the ¢ distribution in the hope that it provides a better 
approximation than the standard normal in finite samples. 


When assumptions 2 and 3 are relaxed, OLS is no longer fully efficient. 
In chapter 6, we present examples of more efficient, feasible generalized 
least-squares estimation. In the current chapter, we continue to use the OLS 
estimator, as is often done in practice, but we use alternative estimates of 
the vce that are valid when assumption 2, assumption 3, or both are relaxed, 
provided the sample size is sufficiently large for the relevant asymptotic 
theory to provide a good approximation. 


3.4.5 Heteroskedasticity-robust standard errors 


Given assumptions 1 and 3, but not 2, we have heteroskedastic uncorrelated 
errors. Then a robust estimator, or more precisely a heteroskedasticity- 
robust estimator, of the vce of the OLS estimator is 


Vichust (a) = (X/X)~? Ce. K ya XiX x) (X! xX)! (3.3) 


For cross-sectional data that are independent, this estimator, introduced by 
White (1980), has supplanted the default variance matrix estimate in most 
applied work because heteroskedasticity is the norm, and in that case, the 
default estimate of the VCE is incorrect. 


In Stata, a robust estimate of the VCE is obtained by using the 
vce (robust) option of the regress command, as illustrated in 
section 3.5.2. Related options are vce (hc2) and vce (hc3), which may 
provide better heteroskedasticity-robust estimates of the vce when the 
sample size is small; see [R] regress. The robust estimator of the vce has 
been extended to other estimators and models, and a feature of Stata is the 
vce (robust) option, which is applicable for many estimation commands. 
Some community-contributed commands use robust in place of 


vce (robust). 


3.4.6 Cluster—robust standard errors 


When errors for different observations are correlated, assumption 3 is 
violated. Then both default and heteroskedastic robust estimates of the VCE 
are invalid, and different ways in which error correlation may arise lead to 
different robust estimates of the VCE. Various robust estimates of the VCE are 
presented in section 13.4. 


For cross-sectional data, the most common violation of assumption 3 is 
that errors are clustered. Clustered or grouped errors are errors that are 
correlated within a cluster or group and are uncorrelated across clusters. A 
simple example of clustering arises when sampling is of independent units 
but errors for individuals within the unit are correlated. For example, 100 
independent villages may be sampled, with several people from each village 
surveyed. Then, if a regression model overpredicts y for one village 
member, it is likely to overpredict for other members of the same village, 
indicating positive correlation. Similar comments apply when sampling is 
of households with several individuals in each household. Another leading 
example is panel data with independence over individuals but with 
correlation over time for a given individual. 


Given assumption 1, but not 2 or 3, a cluster—robust estimator of the VCE 
of the OLS estimator is 


~ > = G N-i rae = 
Varuster (8) = (XX) Care 5 xaaa) a 


where g = 1,...,G denotes the cluster (such as village), UW, is the vector of 
residuals for the observations in the gth cluster, and X, is a matrix of the 
regressors for the observations in the gth cluster. The key assumptions made 
are error independence across clusters and that the number of clusters 

G > oo. 


Cluster—robust standard errors can be computed by using the 
vce (cluster Clustvar) option in Stata, where clusters are defined by the 
different values taken by the clustvar variable. The estimate of the VCE is in 
fact heteroskedasticity-robust and cluster—robust because there is no 


restriction on Cov(ug;, Ug;). The cluster vce estimate can be applied to 
many estimators and models; see section 13.4.6. 


Cluster—robust standard errors must be used when data are clustered. 
For a scalar regressor x, a rule of thumb is that cluster—robust standard 
errors are 


T = /1+ popu(M — 1) (3.4) 


times the incorrect default standard errors, where px is the within-cluster 
correlation coefficient of the regressor, Pu is the within-cluster correlation 
coefficient of the error, and M is the average cluster size. This rule of 
thumb is a good guide in most settings, but when «x is an experimentally 
assigned treatment with values that vary across observations within the 
same cluster, one should use the more general rule of thumb that 

T ~ /1 ++ prul M — 1), where Pru is the within-cluster correlation of ziui. 
Cluster—robust standard errors can be much larger than default or 
heteroskedastic—robust standard errors. 


It can be necessary to use cluster—robust standard errors even where it is 
not immediately obvious. This is particularly the case when a regressor is 
an aggregated or macrovariable because then p, = 1. For example, suppose 
we use data from the U.S. Current Population Survey and regress individual 
earnings on individual characteristics and a state-level regressor that does 
not vary within a state. Then, if there are many individuals in each state so 
M 1s large, even slight error correlation for individuals in the same state can 
lead to great downward bias in default standard errors and in 
heteroskedasticity-robust standard errors. Clustering can also be induced by 
the design of sample surveys. This topic is pursued in section 6.9. 


Statistical inference for OLS based on cluster—robust standard errors uses 
critical values and p-values based on the ¢ distribution with (G — 1) degrees 
of freedom, where G is the number of clusters. When there are few clusters, 
this approximation can lead to considerable underestimation of standard 
errors and associated test p-values and to confidence intervals that are too 
narrow. Better inference for OLS with few clusters is pursued in 


section 6.4.6 and in section 12.6. In particular, see section 12.6 for the 
community-contributed boottest command (Roodman et al. 2019), which 
implements a wild cluster bootstrap that can lead to better finite cluster 
inference. 


Many microeconometric applications use clustered data. Then other 
estimators than OLS are often used, most notably fixed-effects and random- 
effects estimators. For linear models, these methods are presented in 
sections 6.5—6.7 and, for panel data, in chapter 8. For nonlinear models, see 
section 13.9 and, for panel data, see chapter 22. For the recently proposed 
design-based approach to inference, see section 24.4.7. 


3.4.7 Bootstrap standard errors 


An appropriate alternative way to compute heteroskedasticity-robust or 
cluster—robust standard errors is to use an appropriate bootstrap. This is a 
widely applicable method for obtaining standard errors and confidence 
intervals for parameters in cases where the asymptotic distribution is either 
not available or is available but is inconvenient to implement. 


Here we present simple bootstraps that yield standard errors that are 
asymptotically equivalent to those obtained using the vce (robust) and 
vce (cluster clustvar) options. A refined bootstrap procedure, if feasible, 
provides an improvement over the usual asymptotic distribution. These 
distinctions are further developed and used in section 12.5. 


The basic idea of the bootstrap is that the sample is used as a 
population, and we then obtain a number of samples from this “population” 
by repeatedly resampling observations with replacement. Such samples are 
referred to as bootstrap samples. This is a substitute for the ideal but 
impractical situation of having multiple independent samples. We then 
obtain the sampling distribution of the parameters of interest by fitting the 
same model to the many bootstrap samples. Moments of the distribution 
can then be computed from the collection of estimates. 


Resampling from a given sample is easiest to understand in the 
independent and identically distributed setting with sample y;,i = 1,..., N 


. Suppose that the target parameter is the population mean, denoted u, and 
the estimator {i is the sample mean y. Then we can draw B different 
samples of N observations each by sampling with replacement. Each 
sample generates a sample mean, Yp, b = 1,..., B, so we have B 
independent estimates. Moments of the distribution of £ can then be 
computed given the empirical distribution of these B estimates. 


Now consider the linear regression setting with data ( 
Yi, Xi), i = 1,..., N, and model errors that are independent but 
heteroskedastic. A bootstrap called a paired bootstrap or nonparametric 
bootstrap obtains bootstrap resamples by sampling (y;, x;), jointly and with 
replacement. Each bootstrap sample of N observations generates an 
estimate of the regression parameters, denoted By Deere By 


The bootstrap estimate of variance of an estimator is the usual formula 
for estimating a variance of (say) £;, applied to the B bootstrap replications 


1 "N E 
— 
s, = Bat (Bo Bn) 


The bootstrap 100(1 — a) percent confidence interval for 3; is obtained by 
using the asymptotic a percent critical values from the standard normal 
distribution, 


@ — Za/2 X 53, Pj + Za/2 X sz, ) 


where Pr(Z > zaj2) = a/2. This bootstrap yields standard errors and 
confidence intervals that are asymptotically equivalent to those obtained 
using heteroskedastic—robust standard errors. 


When regression model errors are instead clustered, the preceding 
method is adapted by resampling entire clusters with replacement. Each 
bootstrap sample of G clusters generates an estimate of the regression 


2 
bj 
confidence interval are computed using the preceding formulas. This cluster 
pairs bootstrap yields standard errors and confidence intervals that are 
equivalent as G — oo to those obtained using cluster—-robust standard 
errors. 


parameters, denoted By b= 1,..., B. Then $ and the associated 


3.4.8 Regression in logs 


The medical expenditure data are very right skewed. Then a linear model in 
levels can provide very poor predictions because it restricts the effects of 
regressors to be additive. For example, aging 10 years is assumed to 
increase medical expenditures by the same amount regardless of observed 
health status. Instead, it is more reasonable to assume that aging 10 years 
has a multiplicative effect. For example, it may increase medical 
expenditures by 20%. 


We begin with an exponential mean model for positive expenditures, 
with error that is also multiplicative, so y; = exp(x/3)e;. Defining 
€; = exp(u;), we have y; = exp(x/@ + u;i), and taking the natural 
logarithm, we fit the log-linear model 


lny; = x;ß + ui 


by OLS regression of In y on x. The conditional mean of In y is being 
modeled, rather than the conditional mean of y. In particular, 


E(Inylx) = x’ 


assuming u; has conditional mean zero. 


Parameter interpretation requires care. For regression of In y on x, the 
coefficient 6; measures the effect of a change in regressor zj on (In y|x), 
but ultimate interest lies instead on the effect on E'(y|x). Some algebra 


shows that 3; measures the proportionate change in E(y|x) as £j changes, 
called a semielasticity, rather than the level of change in E'(y|x). For 
example, if 6; = 0.02, then a one-unit change in Tj is associated with a 
proportionate increase of 0.02, or a 2% increase, in E (y|x). 


Prediction of E (y|x) is substantially more difficult because it can be 
shown that E(In y|x) Æ exp(x’@). This is pursued in section 4.2.3. Buntin 
and Zaslavsky (2004) compare several alternative regression models for 
medical expenditures. 


3.5 Basic regression analysis 


We use regress to run an OLS regression of the natural logarithm of medical 
expenditures, 1totexp, ON suppins and several demographic and health- 
status measures. Using In y rather than y as the dependent variable leads to 
no change in the implementation of OLS but, as already noted, will change 
the interpretation of coefficients and predictions. 


Many of the details we provide in this section are applicable to all Stata 
estimation commands, not just to regress. 


3.5.1 Correlations 


Before regression, it can be useful to investigate pairwise correlations of the 
dependent variables and key regressor variables by using correlate. We 
have 


. * Pairwise correlations for dependent variable and regressor variables 
. correlate ltotexp suppins phylim actlim totchr age female income 
(obs=2,955) 


ltotexp suppins phylim actlim  totchr age female 

ltotexp 1.0000 

suppins 0.0941 1.0000 

phylim 0.2924 -0.0243 1.0000 

actlim 0.2888 -0.0675 0.5904 1.0000 

totchr 0.4283 0.0124 0.3334 0.3260 1.0000 

age 0.0858 -0.1226 0.2538 0.2394 0.0904 1.0000 

female -0.0058 -0.0796 0.0943 0.0499 0.0557 10.0774 1.0000 

income 0.0023 0.1943 -0.1142 -0.1483 -0.0816 -0.1542 -0.1312 
income 

income 1.0000 


Medical expenditures are highly correlated with the health-status measures 
phylim, actlim, and totchr. The regressors are only weakly correlated with 
each other, aside from the health-status measures. Note that correlate 
restricts analysis to the 2,955 observations where data are available for all 
variables in the variable list. The related command pwcorr, not 


demonstrated, with the sig option gives the statistical significance of the 
correlations. 


3.5.2 The regress command 


The regress command performs OLS regression and yields an analysis-of- 
variance table, goodness-of-fit statistics, coefficient estimates, standard 
errors, ¢ Statistics, p-values, and confidence intervals. The syntax of the 
command is 


regress depvar | indepvars | [ of | lin] | weight | ie options | 


Other Stata estimation commands have similar syntaxes. The output 
from regress is similar to that from many linear regression packages. 


For independent cross-sectional data, the standard approach is to use the 
option vce (robust), which gives standard errors that are valid even if model 
errors are heteroskedastic; see section 3.4.5. 


When the vce (robust) option is used, the output from regress no 
longer includes the analysis-of-variance table presented in output for the 
section 1.6.2 example. The reason is that the F statistic for overall 
significance is calculated from sums of squares and degrees of freedom only 
in the special assumption of homoskedastic errors. 


We obtain 


. * OLS regression with heteroskedasticity-robust standard errors 
. regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 


Linear regression 


ltotexp 


suppins 
phylim 
actlim 
totchr 
age 
female 
income 
_cons 


Coefficient 


. 2556428 
. 3020598 
. 3560054 
3758201 
.0038016 
- .0843275 
. 0025498 
6.703737 


Robust 


std. err. 


.0465982 
.057705 
. 0634066 
.0187185 
. 0037028 
.045654 
.0010468 
. 2825751 


5.49 
5.23 
5.61 
20.08 
1.03 
-1.85 
2.44 
23.72 


Number of obs = 2,955 
F(7, 2947) = 126.97 
Prob > F = 0.0000 
R-squared = 0.2289 
Root MSE = 1.2023 
P>|t| [95% conf. interval] 
0.000 . 1642744 .3470112 
0.000 . 1889136 -415206 
0.000 . 2316797 . 4803311 
0.000 .3391175 .4125228 
0.305 - .0034587 .011062 
0.065 - . 1738444 .0051894 
0.015 . 0004973 .0046023 
0.000 6.149673 7.257802 


The regressors are jointly statistically significant because the overall F 
statistic of 126.97 has a p-value of 0.000. At the same time, much of the 
variation is unexplained with R2 — 0.2289. The root MSE statistic reports s, 
the standard error of the regression, defined in (3.2). For a two-sided test at 
level 0.05, a regressor is individually statistically significant if p < 0.05. 
Thus age and female are statistically insignificant, while the other variables 
are Statistically significant at level 0.05. The statistical insignificance of age 
may be due to sample restriction to elderly people and the inclusion of 
several health-status measures that capture well the health effect of age. 


Statistical significance of coefficients is easily established. More 
important is the economic significance of coefficients, meaning the 
measured impact of regressors on medical expenditures. This is 
straightforward for regression in levels because we can directly use the 
estimated coefficients. But here the regression is in logs. From section 3.4.8, 
in the log-linear model, parameters need to be interpreted as semielasticities. 
For example, the coefficient on suppins is 0.256. This means that private 
supplementary insurance is associated with a 0.256 proportionate rise, or a 
25.6% rise, in medical expenditures. Similarly, large effects are obtained for 
the health-status measures, whereas health expenditures for women are 8.4% 
lower than those for men after controlling for other characteristics. The 
income coefficient of 0.0025 suggests a very small effect, but this is 


misleading. The standard deviation of income is 23 (see section 3.2.3), soa 
1-standard deviation in income leads to a 0.058 proportionate rise, or 5.8% 
rise, in medical expenditures. 


MES in nonlinear models are discussed in more detail in section 13.7. The 
preceding interpretations are based on calculus methods that consider very 
small changes in the regressor. For larger changes in the regressor, the finite- 
difference method is more appropriate. Then the interpretation in the log- 
linear model is similar to that for the exponential conditional mean model; 
see section 13.7.3. For example, the estimated effect of going from no 
supplementary insurance (suppins=0) to having supplementary insurance 
(suppins=1) is more precisely a 100 x (e9-2° — 1), or 29.2%, rise. 


The ereturn list command provides a list of what estimation results 
are stored in e (); see section 1.6.2 for details following the regress 
command. Stored results include regression coefficients in e (b) and the 
estimated VCE in e (Vv). 


Various postestimation commands, summarized next, enable prediction, 
computation of residuals, hypothesis testing, and model specification tests. 


To see what stored results and postestimation commands are available, 
give commands 


. * Display stored results and list available postestimation commands 
. ereturn list 


(output omitted ) 
. help regress postestimation 


(output omitted ) 


3.5.3 Postestimation commands 


Standard postestimation commands available after most estimation 
commands are given in table 3.1. 


Table 3.1. Postestimation commands 


Topic Command 


Results estat summarize, estat vce, estat ic, etable 
Store results estimates 

Prediction predict, predictnl 

MES margins, marginsplot 

Confidence intervals lincom, nlcom 

Hypothesis tests test, testnl, lrtest 

Specification tests hausman, linktest 


These postestimation commands are not only available following standard 
model commands such as logit and poisson but also available for the n1 
and mlexp commands for user-defined models. The estat commands 
provide extended statistics. 


Other standard postestimation commands available following the 
regress command, and some other commands, are contrast for hypothesis 
tests, pwcompare for pairwise comparison of estimates, suest for seemingly 
unrelated regressions, and forecast for time-series regressions. 


Postestimation commands specific to the regress command are estat 
imtest and estat ovtest for specification tests; estat hettest and estat 
szroeter for heteroskedasticity tests; estat moran for spatial correlation 
tests; dfbeta for influence statistics; estat vif for variance inflation factors 
for regressors; and estat esize for effect sizes. Additional estat 
commands following regress provide diagnostic tests for time-series 
regression. Similarly, specific postestimation estat commands are available 
following many other estimation commands. 


Many of these postestimation commands are presented in this chapter 
and the subsequent chapter, while others are presented later in the book. 


3.5.4 Regression subject to constraints on the parameters 


A regression model can be fit subject to constraints on the parameters. 


For example, to obtain the least-squares estimates subject to 
Dphylim = Bactlim We define the constraint using constraint define and 
then fit the model using cnsreg for constrained regression with the 
constraints () option. 


. constraint 1 phylim = actlim 


. cnsreg ltotexp suppins phylim actlim totchr age female income, 
> constraints(1) vce(robust) 


(output omitted ) 


See exercise 2 at the end of this chapter for an example. 
3.5.5 Hypothesis tests 


The test command performs hypothesis tests using the Wald test procedure 
that uses the fitted model coefficients and vcE. We present some leading 
examples here, with a more extensive discussion deferred to section 11.3. 
The F statistic version of the Wald test is used after regress, whereas for 
many other estimators, the chi-squared version is instead used. 


A common test is one of equality of coefficients. For example, consider 
testing that having a functional limitation has the same impact on medical 
expenditures as having an activity limitation. The test of 
Ho: Bony1im = Bactlim against Ha: Pohyiin ra Bactlim iS implemented as 


. * Wald test of equality of coefficients 
. qui regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 


. test phylim = actlim 
( 1) phylim - actlim = 0 


FC 1, 2947) 
Prob > F 


0.27 
0.6054 


Because p = 0.61 > 0.05, we do not reject the null hypothesis at the 5% 
significance level. There is no statistically significant difference between the 
coefficients of the two variables. 


Another common test is one of the joint statistical significance of a 
subset of the regressors. A test of the joint significance of the health-status 
measures is one of Ho: Bony1im = 9, Bactlim = 0, Btotchr = 0 against H4: 
that at least one is nonzero. This is implemented as 


* Joint test of statistical significance of several variables 
. test phylim actlim totchr 


( 1) phylim = 0 


( 2) actlim = 0 
( 3) totchr = 0 
FC 3, 2947) = 272.36 
Prob > F = 0.0000 


These three variables are jointly statistically significant at the 0.05 level 
because p = 0.00 < 0.05. 


3.5.6 Tables of output from several regressions 


It is very useful to be able to tabulate key results from multiple regressions 
for both one’s own analysis and final report writing. 


The estimates store command after regression leads to results in e () 
being associated with a user-provided model name and preserved even if 
subsequent models are fit. Given one or more such sets of stored estimates, 
estimates table presents a table of regression coefficients (the default) and, 
optionally, additional results. A related command is the estimates selected 
command, which enables results to be sorted based on the variable names or 
coefficient magnitudes. The estimates stats command lists the sample size 
and several likelihood-based statistics. 


We compare the original regression model with a variant that replaces 
income with educyr. The example uses several of the available options for 
estimates table. 


. * Store and then tabulate results from multiple regressions 
. qui regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 


. estimates store REG1 
. qui regress ltotexp suppins phylim actlim totchr age female educyr, vce(robust) 


. estimates store REG2 


. estimates table REG1 REG2, b(%9.4f) se stats(N r2 F 11) 
> stfmt(%9.2f) keep(suppins income educyr) 


Variable REG1 REG2 
suppins 0.2556 0.2063 
0.0466 0.0471 
income 0.0025 
0.0010 
educyr 0.0480 
0.0070 
N 2955 2955 
r2 0.23 0.24 
F 126.97 132.53 
11 -4733.45 -4710.96 


Legend: b/se 


This table presents coefficients (b) and standard errors (se); other available 
options include ¢ statistics (t) and p-values (p). The statistics given are the 
sample size, the R2, the overall F statistic (based on the robust estimate of 
the vce), and the log likelihood (based on the strong assumption of normal 
homoskedastic errors). The keep () option, like the drop () option, provides 
a way to tabulate results for just the key regressors of interest. Here educyr 
is a much stronger predictor than income because it is more highly 
statistically significant and R2 is higher, and there is considerable change in 
the coefficient of suppins. 


3.5.7 Even better tables of regression output 


The preceding table is very useful for model comparison but has several 
limitations. It would be more readable if the standard errors appeared in 
parentheses. It would be beneficial to be able to report a p-value for the 
overall F statistic. Also, some work may be needed to import the table into a 
table format in external software such as Excel, Word, or LaTeX. 


There are several ways to proceed. 


The etable command, introduced in an update to Stata 17, is designed 
specifically for tables of regression results. A richer table for the current 


example is the following: 


. etable, estimates(REG1 REG2) keep(suppins income educyr) 


> cstat(_r_b, nformat(48.4f)) cstat(_r_se, nformat(%8.4f)) 
> mstat(N) mstat(r2) mstat(F) mstat(11) column(estimates) 
> stars(0.1 "*" 0.05 "**" 0.01 "***") showstars showstarsnote 
REG1 REG2 
Has supp priv insurance 0.2556 *** 0.2063 *** 
(0.0466) (0.0471) 
Annual household income/1000 0.0025 ** 
(0.0010) 
Years of education 0.0480 **x 
(0.0070) 
Number of observations 2955 2955 
R-squared 0.23 0.24 
F statistic 126.97 132.53 
Log likelihood -4733.45 -4710.96 


*kK p<.01, ** p<.05, * p<.1 


To export this table to, for instance, a LaTeX file called mytable. tex, we 
could add the option export (mytable. tex). 


Several community-contributed commands provide better output than the 
simplest Stata commands. The Stata add-on esttab of Jann (2007) is an 
extension of estimates table. Related add-ons of Jann (2005, 2007) are 
estout, a richer but more complicated version of esttab, and eststo, which 
extends estimates store. 


The community-contributed commands outreg (Gallup 2012a), 
frmttable (Gallup 2012b), and outreg2 (Wada 2005) also create tables and 
generate formatted Word and LaTeX files. 


Finally, one can use the more flexible version of the table command, 
introduced in Stata 17, that permits considerable formatting. For the current 
example, we have 


* Tabulate results with some formatting using the table command 
global xlist1 suppins phylim actlim totchr age female 


table (colname[suppins income educyr] result) (command), 
command(_r_b _r_se: regress ltotexp $xlist1 income, vce(robust) ) 
command(_r_b _r_se: regress ltotexp $xlist1 educyr, vce(robust) ) 
nformat (48.4f) sformat("(4s)" _r_se) style(table-reg3) 
stars(_r_p 0.01 "**#" 0.05 "#«" 0.1 "*", attach(_r_b)) 


VVVMs 


Has supp priv insurance O0.2556*** 0.2063*** 
(0.0466) (0.0471) 


Annual household income/1000 0.0025** 
(0.0010) 


Years of education 0.0480*** 
(0.0070) 


The collect command, introduced in Stata 17, provides even more 
flexibility. Here we add measures of fit for each model and provide better 
labeling. 


* Tabulate results with even better formatting using the collect command 
capture collect clear 


collect title "Regression with income or education" 
collect notes "Heteroskedastic-robust standard errors" 


. qui: collect _r_b _r_se, tag(model[(1): Income]): 
> regress ltotexp suppins $xlist1 income, vce(robust) 


. qui: collect _r_b _r_se, tag(model[(2): Education] ): 
> regress ltotexp suppins $xlist1 educyr, vce(robust) 


collect label values colname suppins "Supplementary insurance", modify 
collect stars _r_p 0.01 "***" 0.05 "**" 0.1 "*", attach(_r_b) 

collect style cell, nformat(%9.4f) border(right, pattern(nil)) 

collect style cell result[_r_se], sformat("(%s)") 

collect style header result[N r2 F 11], level(label) 

collect label levels result r2 "R-squared", modify 

collect style header result, level (hide) 

collect style column, extraspace(1) 

collect style row stack, spacer delimiter(" x ") 


. qui: collect layout (colname[suppins income educyr]#result result(N r2 F 11]) 
> (model) 


collect preview 


Regression with income or education 


(1): Income (2): Education 


Supplementary insurance 0.2556*** 0.2063*+* 
(0.0466) (0.0471) 
Annual household income/1000 0.0025** 
(0.0010) 
Years of education 0.0480*** 
(0.0070) 
Number of observations 2955.0000 2955.0000 
R-squared 0.2289 0.2406 
F statistic 126.9723 132.5337 
Log likelihood -4.73e+03 -4.71e+03 


Heteroskedastic-robust standard errors 


The table can be written to a file that, for example, creates a table in 
Microsoft Word. 


* Write tabulated results to a file in Microsoft Word format 
collect export mytable.docx, replace 
(collection default exported to file mytable.docx) 


The collect command is sufficiently flexible to warrant its own manual, 
[TABLES] Stata Customizable Tables and Collected Results Reference 
Manual. 


3.5.8 Factor variables for categorical variables and interactions 


Suppose we wish to add as regressors to the regression model a set of 
indicator variables for family size and this set of indicators interacted with 
income. From sections 1.3.4 and 2.4.7, the factor variables i.famsze form a 
set of indicator variables based on the nonnegative, integer-valued 
categorical variable famsze, and the factor-variables operator 
c.income#i.famsze denotes the continuous variable income interacted with 
the set of indicators. 


. * Factor variables for sets of indicator variables and interactions 
. regress ltotexp suppins phylim actlim totchr age female c.income 

> i.famsze c.income#i.famsze, vce(robust) noheader allbaselevels 
note: 8.famsze#c.income omitted because of collinearity. 


note: 13.famsze#c.income omitted because of collinearity. 


ltotexp 


suppins 
phylim 
actlim 
totchr 
age 
female 
income 


famsze 
1 


WOOAONDOARWHN 


e re 


famsze#c.income 
1 


WOAUANDORWN 


e e 


cons 


Coefficient 


. 2393808 
. 3053458 
. 3464812 
. 3743755 
.00313 
-.0725641 
. 0028057 


0 
0759158 
-.2180488 
- .2928383 
. 393989 

- .3438142 
1.101773 
. 216274 
1.482976 
-1.874285 


0 

- .0012899 
004134 
0160613 
-.0251491 
. 0280329 
-.0324118 
0 
-.1759027 
0 


6.748094 


Robust 


std. err. 


. 0466804 
.0575971 
.0631655 
.0187983 
. 0037607 
.0475022 
.0015684 


(base) 
.0722829 
. 1310662 
. 1983967 
.4501513 
.4524585 
.5046005 
.0625337 
. 2976336 
.0712566 


(base) 
. 0020704 
. 0039464 
. 0083284 
.017609 
.0227835 
.0279151 

(omitted) 
.0169673 

(omitted) 


. 3005551 


22. 


t 


.13 
.30 
.49 
.92 
.83 
.53 
.79 


.05 
.66 
.48 
.88 
.76 
.18 
.46 
.98 
.30 


.62 
.05 
.93 
.43 
.23 
.16 


.37 


45 


P>|tl 


oo0oo0oo0oo0o0oO 


oOoOo0oo0o0o00000O 


OOM OOO) 


. 000 
. 000 
. 000 
. 000 
.405 
.127 
.074 


.294 
.096 
.140 
. 382 


447 


.029 
.001 
. 000 
. 000 


.533 
.295 


054 


.153 
.219 
. 246 


. 000 


. 000 


[95% conf. 


. 1478511 
. 192411 
. 2226279 
. 3375162 
- .0042438 
-.1657051 
-.0002695 


-.0658145 
-. 4750399 
-.6818493 
-. 4886557 
-1.230983 
-2.09118 
. 0936596 
. 8993834 
-2.014003 


. 0053495 
-. 003604 
. 0002688 
.0596764 
.0166403 
-.087147 


. 2091717 


6.158773 


interval] 


. 3309104 
.4182807 
. 4703345 
.4112347 
.0105039 
.0205769 
. 0058809 


. 2176462 
. 0389423 
.0961727 
1.276634 
. 5433545 
-.1123653 
. 3388884 
2.066568 
-1.734567 


. 0027697 
.0118719 
.0323915 
.0093781 
.0727062 
. 0223234 


-. 1426337 


7.337414 


Here there are 10 possible indicator variables for family size (1—8, 10, and 
13), and the indicator for the lowest valued of these (famsze = 1) is the 
base category that is omitted from the regression. In principle, there should 
be as many interactions with income included in the regression, but those 
corresponding to famsze equal to 8 and 13 are omitted because they are not 
identified for these data where only one observation has famsze equal to 8 
and only one has famsze equal to 13. 


We can test for joint significance of the sets of indicator variables, 
including their interaction with income, with the following command: 


. * Test joint significance of sets of indicator variables and interactions 
. testparm i.famsze c.income#i.famsze 


( 1) 2.famsze = 0 
( 2) 3.famsze = 0 
( 3) 4.famsze = 0 
( 4) 5.famsze = 0 
( 5) 6.famsze = 0 
( 6) 7.famsze = 0 
( 7) 8.famsze = 0 
( 


8) 10.famsze = 0 
( 9) 13.famsze = 0 
(10) 2.famsze#c.income 
(11) 3.famsze#c.income 
(12) 4.famsze#c.income 
(13) 5.famsze#c.income 
(14) 6.famsze#c.income 
(15) 7.famsze#c.income 
(16) 10.famsze#c.income = 0 


F( 16, 2931) = 
Prob > F = 0.0000 


oo0oo0oo0o0ooO 


The sets of indicator variables for famsze are jointly statistically significant 
at level 0.05 because p = 0.00 < 0.05. The number of degrees of freedom is 
16, so the additional two omitted variables (interaction with income when 
famsze equals 8 or 13) were correctly accounted for. 


Calculation of the MEs with respect to income or family size, which is 
more complicated in this model, is considered in the next section. 


3.5.9 Average marginal effects 


It becomes difficult to interpret regression results when interacted regressors 
are present, or more generally when models become nonlinear in the 
variables of interest. For example, suppose 

E(y|x, z) = B1 + Box + B3z + Bax x z. Then changes in x lead to changes 
in E(y|x, z) through both the regressor x and the interacted regressor x x z. 
Using calculus methods, we obtain the ME of x as 

JE(y|x, z)/Ox = Bz + Baz. Similarly, if a regressor enters quadratically, 
with E(y|x) = 6; + Box + 83x7, then changes in x lead to changes in 


E(y|x) through both the regressor x and its square 7-2. Using calculus 
methods, we obtain the ME of x as OE (y|x) /Ox = Bo + 2632. 


Despite the simplicity of the previous examples, there are many ways to 
compute such MES. 


First, they vary according to what values of the regressors x and z the 
MEs are evaluated at. Common approaches are 1) to evaluate for each sample 
observation and then average (called an average ME or AME); 2) to evaluate at 
the sample mean of the regressors; 3) to evaluate at specified values of the 
regressors; and 4) to use a combination of these approaches with specified 
values given for only some of the regressors. 


Second, MEs can be computed using calculus methods or by instead 

using the finite-difference method that changes regressors by one unit, so 

AE(y|x)/Axv = E(y|a +1) — E(y|x) . For linear models such as 
E(y|x) = 8, + Box that do not have complications such as interactions or 
polynomials, we obtain the ME as 35 using either calculus or finite-difference 
methods. But in nonlinear models such as the logit model with 

E(y|x) = 9(81 + Box), calculus and finite-difference methods lead to 
different results. Calculus methods are used for continuous regressors, while 
finite-difference methods are often used for discrete regressors. 


In this section, we consider only average MEs; these provide a very useful 
summary of the impact of each underlying variable. AMEs can be computed 
using the margins postestimation command with the ayax () option. Such 
use of the margins command requires that factor-variable operators be used 
appropriately in defining the fitted model. For example, if x and z are 
continuous regressors, in the preceding motivating examples, we need to 
use, respectively, commands regress y c.x##c.z and regress y 


C.X##C.X. 


As illustration, consider the regression model of section 3.2.4 with two 
complications. First, age enters quadratically. Because variable age is 
continuous, the regressor list will include c.age##c.age. Second, the 
continuous regressor income is interacted with the categorical regressor 
female, so the regressor list includes c. income##i. female. 


For completeness, we use factor-variable notation to additionally identify 
the remaining regressors as either categorical indicators or continuous. This 
is not necessary in this specific example, because the AME following OLS 
regression for the remaining regressors is simply the estimated coefficient 
using either finite difference or calculus methods. The first three regressors 
are binary indicators and enter with the i. prefix. The regressor totchr was 
treated as a continuous regressor and enters with the c. prefix; if instead we 
use i.totchr, then the model would include a set of indicator variables for 
the various values taken by variable totchr. We use the nofvlabel option to 
display factor-variable level values rather than value labels. We obtain 


. * Factor variables for model with interactions and quadratic 
. regress ltotexp i.suppins i.phylim i.actlim c.totchr c.age##c.age 


> i.female##c.income, vce(robust) noheader nofvlabel 
Robust 
ltotexp | Coefficient std. err. t P>|t| [95% conf. interval] 
1.suppins . 259794 .0467268 5.56 0.000 . 1681735 .3514145 
1.phylim . 3036022 .0575284 5.28 0.000 . 1908023 . 4164022 
1.actlim .3741712 .0631516 5.92 0.000 . 2503455 . 4979969 
totchr .3722821 .0186565 19.95 0.000 .335701 . 4088632 
age . 2832178 .087228 3.25 0.001 .1121839 . 4542518 
c.age#c.age -.0018595 .0005795 -3.21 0.001 -.0029958 - .0007232 
1.female -.1819534 .0666154 -2.73 0.006 -.3125709 -.0513358 
income .0009244 .0013742 0.67 0.501 -.0017701 .0036189 
female#c.income 

1 . 0045633 .0020141 2.27 0.024 .000614 .0085125 
_cons -3.677626 3.267329 -1.13 0.260 -10.08411 2.728855 


Variable age appears quadratically as 0.283 x age — 0.00186 x age? with 
ME 0.283 — 0.00372 x age, which varies with age. For example, the ME is 
0.097 at age = 50. And variables female and income appear interactively as 
— 0.182 x female + 0.000924 x income + 0.00456 x female x income. 
The ME of female is then — 0.182 + 0.00456 x income, and the ME of 
income is 0.000924 + 0.00456 x female. These mes clearly vary with the 
point of evaluation. 


The margins, dydx(*) command (see section 4.5) then yields the 
average MES for all variables. 


* Average MEs in model with interactions and quadratic 


. margins, dydx(*) nofvlabel 


Average marginal effects 
Model VCE: Robust 


Expression: Linear prediction, predict() 


Number of obs = 2,955 


dy/dx wrt: 1.suppins 1.phylim 1.actlim totchr age 1.female income 
Delta-method 

dy/dx std. err. t P>|t| [95% conf. interval] 
1.suppins . 259794 . 0467268 5.56 0.000 . 1681735 .3514145 
1.phylim . 3036022 0575284 5.28 0.000 . 1908023 .4164022 
1.actlim .3741712 .0631516 5.92 0.000 . 2503455 .4979969 
totchr .3722821 .0186565 9.95 0.000 . 335701 . 4088632 
age .0070988 .003857 1.84 0.066 - .0004639 .0146614 
1.female - .0784425 .0456887 1.72 0.086 -. 1680276 .0111425 
income . 0035898 .0010747 3.34 0.001 .0014826 . 005697 


Note: dy/dx for factor levels is the discrete change from the base level. 


For the first four variables, the average MEs are simply the estimated 
regressor coefficients. For the remaining variables that enter quadratically or 
interactively, we have the following average MEs. Aging one year is 
associated with a 0.0071 increase in 1totexp, which corresponds to a 0.71% 
increase in the level of total medical expenditure. Being female is associated 
with a 0.0784 decrease in 1totexp, or a 7.84% decrease in total medical 
expenditure. And a $1,000 increase in annual household income is 
associated with a 0.0036 increase in 1totexp, or a 0.36% increase in total 
medical expenditure. The output additionally permits statistical inference. 
Using a two-sided test at level 0.05, we find that the effects of age and 
female are statistically insignificant and that the other variables are 
statistically significant. 


The margins command, and the related marginsplot command, is 
presented in much greater detail in sections 4.44.5. 


3.5.10 Cluster—robust standard errors 


As a purely illustrative example of clustering, we suppose that model errors 
are correlated for individuals who have the same age and are independent for 
those of different ages. For the current data, the variable age takes 26 unique 
values, so there are 26 clusters of varying sizes. 


Cluster—robust standard errors can be obtained using the vce (cluster 
clustvar) option of the regress command, where here we use 
vce(cluster age) because clusters are defined by the unique values of 
variable age. We obtain 


. * Cluster-robust standard errors 
. regress ltotexp suppins phylim actlim totchr age female income, vce(cluster age) 


Linear regression Number of obs = 2,955 
F(7, 25) = 178.77 
Prob > F = 0.0000 
R-squared = 0.2289 
Root MSE = 1.2023 
(Std. err. adjusted for 26 clusters in age) 

Robust 
ltotexp | Coefficient std. err. t P>|tl [95% conf. interval] 
suppins . 2556428 .0546516 4.68 0.000 . 1430856 . 3681999 
phylim . 3020598 0634415 4.76 0.000 .1713995 43272 
actlim . 3560054 06361 5.60 0.000 . 2249981 -4870127 
totchr . 3758201 .0153014 24.56 0.000 . 3443064 . 4073339 
age .0038016 . 0063709 0.60 0.556 -.0093195 .0169228 
female -.0843275 . 0469837 -1.79 0.085 -.1810923 .0124372 
income . 0025498 .0013169 1.94 0.064 -.0001623 .0052619 
_cons 6.703737 . 4943524 13.56 0.000 5.6856 7.721875 


For this somewhat contrived example, we expect little difference from the 
heteroskedastic-robust standard errors given earlier. In fact, for several 
variables, there is little difference; for several variables, the cluster—robust 
standard errors are larger; and for totchr, the cluster—robust standard errors 
are smaller. 


3.5.11 Bootstrap standard errors 


Bootstrap methods, introduced in section 3.4.7 and detailed in chapter 12, 
are resampling methods that are most often used to obtain standard errors 
when it is difficult to obtain standard errors by other methods. 


A key ingredient is the number of bootstrap resamples, denoted B, to use 
with a tradeoff between computational time and precision. Unless otherwise 
stated, we use B = 400 or B = 999; see section 12.3.4. To ensure that future 
execution of the same command leads to the same bootstrap samples, and 


hence the same bootstrap standard errors, we set the seed for the random- 
number generator used to determine the bootstrap resamples. 


There are many ways to bootstrap. We first consider a bootstrap to obtain 
heteroskedastic-robust standard errors. This can be obtained using the 
vce (bootstrap) option of the regress command. We obtain 


. * Bootstrap to give heteroskedastic-robust standard errors 

. regress ltotexp suppins phylim actlim totchr age female income, 
> vce(bootstrap, reps(999) seed(10101) ) 

(running regress on estimation sample) 


Bootstrap replications (999) 


| 1 | 2 | 3 | 4 | 5 

SO ANE TEM EE E E E a AE E raion E at 50 
E E TEE EE A EE E E E A E 100 
aed E r EEO EER E E E EE E A EA 150 
T E E E E E 16. E E S EE E aes E E E 200 
EE O E E EEE E E EEEE A 250 
aaa E EE enon aa a E ee tae a a A E A ea tet eile 300 
EVNER EES E EETAS cal 8 Ane ERRAIN Va EENE IVANE RENAA 350 
SA sh ct Meeri aaa leh i a A A a a a a a de a eB 400 
KERERE EN ERREA & EEEO EE deste oligos: DERN A 450 
E E E N E EA EEE EEE ES EN 500 
E EEEE E E Pd E T N EE E EE SEE 550 
e n Mua ole hed Aa a ANE a MBS e ES 600 
Neea ea Aan Gisele aea EE Pe A Mone. he EA AAE 650 
deh tanta ea E E E aia EE N AE E E a ag 700 
E E E E EE EE E EE E E E E 750 
Sig ata eth EE E E EE AA A E E A dss a Me 800 
TOUET E dads Vacs tetas. O EEA E tes O a deat ET S ES Gt Rts 850 
ita dias ETE EREA TERRE gait mah ace E EEEE E EEANN e OEA 900 
Doea aE E E WA Sea Ea a a aa E a a E al E E EE ae 950 

Linear regression Number of obs = 2,955 

Replications = 999 

Wald chi2(7) = 958.51 

Prob > chi2 = 0.0000 

R-squared = 0.2289 

Adj R-squared = 0.2271 

Root MSE = 1.2023 

Observed Bootstrap Normal-based 

ltotexp | coefficient std. err. z P>|zl [95% conf. interval] 

suppins . 2556428 . 0469263 5.45 0.000 . 163669 . 3476165 

phylim . 3020598 058277 5.18 0.000 . 187839 - 4162806 

actlim . 3560054 .0653862 5.44 0.000 . 2278508 -48416 

totchr . 3758201 .0184238 20.40 0.000 . 3397102 .4119301 

age .0038016 . 003648 1.04 0.297 - . 0033483 .0109516 

female - .0843275 .0457723 -1.84 0.065 -. 1740396 . 0053845 

income . 0025498 .001072 2.38 0.017 . 0004488 . 0046508 

_cons 6.703737 . 2768602 24.21 0.000 6.161101 7.246374 


These bootstrap heteroskedastic—robust standard errors are within 3% of the 
heteroskedastic-robust standard errors obtained using the vce (robust) 


option. The biggest relative difference is for act 1im, which has a standard 
error of 0.06539 compared with 0.06361. 


The preceding bootstrap standard errors are asymptotically equivalent to 
the preceding heteroskedastic—robust standard errors. Part of the observed 
difference is due to using only 999 bootstrap replications, a difference that 
diminishes as the number of bootstrap replications increases. And part of the 
difference is due to finite degrees-of-freedom corrections. 


We next obtain cluster—robust bootstrap standard errors using the 
vce (bootstrap, cluster ()) option of the regress command. We obtain 


* Bootstrap to give cluster--robust standard errors 


. regress ltotexp suppins phylim actlim totchr age female income, 
> vce(bootstrap, cluster(age) nodots reps(999) seed(10101)) 


Linear regression 


ltotexp 


suppins 
phylim 
actlim 
totchr 
age 
female 
income 
_cons 


Observed 


coefficient 


. 2556428 
. 3020598 
. 3560054 
. 3758201 
. 0038016 
- . 0843275 
. 0025498 
6.703737 


Number of obs = 2,955 
Replications = 999 
Wald chi2(7) = 1110.58 
Prob > chi2 = 0.0000 
R-squared = 0.2289 
Adj R-squared = 0.2271 
Root MSE = 1.2023 


(Replications based on 26 clusters in age) 


Bootstrap 
std. err. Zz 
.0544001 4.70 
.0613503 4.92 
. 0646722 5.50 
.0159496 23.56 
.0066821 0.57 
.0475312 -1.77 
.001266 2.01 
.5185035 12.93 


P>|z| 


oOoo0oo0oo0oo0oo0o0oO 


. 000 
. 000 
. 000 
. 000 
.569 
.076 
.044 
. 000 


Normal-based 
[95% conf. interval] 


. 1490206 . 3622649 
. 1818154 . 4223042 
. 2292502 . 4827606 
. 3445595 . 4070808 
-.009295 .0168983 
-.177487 . 0088319 
. 0000685 .0050311 


5.687489 7.719986 


These bootstrap cluster—robust standard errors are within 4% of the cluster— 
robust standard errors obtained earlier. 


In this example, we have used the nodots option to suppress the output 
of a dot for each bootstrap replication. More generally, the command set 
dots off will suppress dots in all commands that by default produce dots. 


Sometimes, one is interested in specific nonlinear functions of estimated 
coefficients (and data) that may have economic meaning and significance, 
for example, the ratio of two coefficients or the elasticity of the dependent 
variable with respect to a particular regressor. Exact confidence intervals for 
such a function are generally difficult to compute, and one often settles for 
an approximation based on the delta method. In such instances, the bootstrap 
method provides an alternative. 


To illustrate this, we consider inference on the ratio of the slope 
parameters for phylim and actlim. We can no longer use the 
vce (bootstrap) option of the regress command and instead need to use the 
bootstrap prefix. Here the quantity being bootstrapped is denoted 
_b[suppins]/_b[phylim]. We obtain 


. * Bootstrap to compute confidence interval for a ratio of two coefficients 
. bootstrap _b[suppins]/_b[phylim], reps(999) seed(10101) nodots: 


> regress ltotexp suppins phylim actlim totchr age female income 
Linear regression Number of obs = 2,955 
Replications = 999 


Command: regress ltotexp suppins phylim actlim totchr age female income 
_bs_1: _b[suppins]/_b[phylim] 


Observed Bootstrap Normal-based 
coefficient std. err. Zz P>|z| [95% conf. interval] 
_bs_1 . 8463317 . 262695 3.22 0.001 . 3314588 1.361204 


Although the sample estimate of the ratio is close to 1, the confidence 
interval is quite wide. A similar approach can be used to construct bootstrap 
confidence intervals for other nonlinear functions of parameters. However, 
care should be exercised to ensure that the target function is differentiable 
and well behaved. For example, if the function is a ratio, then the 
denominator needs to be bounded away from 0 so that the moments exist. 


3.5.12 OLS regression to test mean and difference in mean 


The methods presented in an introductory statistics course for inference on 
the mean of a single variable, using a one-sample ¢ test, or on the difference 
in mean, using a two-sample ¢ test, are implemented in Stata using the ttest 
command. 


These tests can also be implemented using OLS regression. OLS regression 
has many advantages over the ttest command. If errors are clustered, for 
example, we can perform valid inference on the mean by using command 
regress y, vce (cluster id clu). We can add control variables, as already 
done in the preceding analysis. With many categories of health insurance, we 
can simply regress on indicator variables for the various categories (with one 
category omitted). And extension to difference in differences is 
straightforward; see section 4.8. 


ttest command 


For a one-sample ¢ test on the mean H of a single variable, the command 
syntax is 


ttest varname == # [ af | |in | l, level (#) | 


Inference is based on assuming that 


-I-E HN- 
e t(N — 1) 


For a two-sample ¢ test of the difference in means jz; — uo of two 
variables yı and yo, the command syntax is 


ttest varname lif | lin], by (groupvar) | options | 


Inference for the two-sample test varies with whether the variances g? and 
o2 are assumed to be equal, and the norm in economics is to assume unequal 
variances. Then the ttest command with option unequal bases inference on 
assuming that 


(Yı = Yo) = (1 — Ho) o t(v) 


V/'st/Ni + 83/No 


t= 


where there are several different formulas for v, including v = N; + No — 2 


For completeness, we note that the ttest command can also be used to 
test equality of means of two variables yı and Y2, where the variables may be 
paired (with yı; and Y2; data for the same individual) or unpaired. 


One-sample t test 


We first consider inference on the mean / of a single variable, using the 
mean command that gives a confidence interval for 4 in addition to the ttest 
command. For the level of health expenditure, and a test that u = 7500, we 
obtain 


. * Test of mean using ttest 
. ttest totexp = 7500 


One-sample t test 


Variable Obs Mean Std. err. Std. dev. [95% conf. interval] 

totexp 3,064 7030.889 214.1287 11852.75 6611.039 7450.74 
mean = mean(totexp) t = -2.1908 

HO: mean = 7500 Degrees of freedom = 3063 
Ha: mean < 7500 Ha: mean != 7500 Ha: mean > 7500 
Pr(T < t) = 0.0143 Pr(|T| > Iti) = 0.0285 Pr(T > t) = 0.9857 


Identical results are obtained by OLS regression of totexp on just an 
intercept with default standard errors (or if option vce (robust) is used). We 
have 


* Test of mean using regress 
. regress totexp, noheader 


totexp | Coefficient Std. err. t P>|t | [95% conf. interval] 


_cons 7030.889 214.1287 32.83 0.000 6611.039 7450.74 


. test _cons=7500 
( 1) _cons = 7500 


FC 1, 3063) = 4.80 
Prob > F = 0.0285 


The test postestimation command reports an F'(1, N — 1) statistic that is 
the square of a t(N — 1) statistic. Here (—2.1908)? = 4.80 and both tests 
yield p = 0.0285. 


Two-sample t test 


Next consider the difference in mean health expenditure by whether the 
person has supplemental health insurance. The ttest command yields 


* Test of difference in means using ttest 
. ttest totexp, by(suppins) unequal 


Two-sample t test with unequal variances 


Group Obs Mean Std. err. Std. dev. [95% conf. interval] 

No 1,283 6420.058 312.6458 11198.66 5806.704 7033.411 

Yes 1,781 7470.921 291.1413 12286.72 6899 .907 8041.936 

Combined 3,064 7030.889 214.1287 11852.75 6611.039 7450.74 

diff -1050.864 427 .2127 -1888.535 -213.1925 

diff = mean(No) - mean(Yes) t = -2.4598 

HO: diff = 0 Satterthwaite°s degrees of freedom = 2899.24 
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 

Pr(T < t) = 0.0070 Pr(|T| > Itl) = 0.0140 Pr(T > t) = 0.9930 


OLS regression of totexp On suppins with heteroskedastic—robust 
standard errors yields almost identical results. 


* Test of difference in means using regress 
. regress totexp suppins, vce(robust) 


Linear regression Number of obs = 3,064 
F(1, 3062) = 6.05 

Prob > F = 0.0140 

R-squared = 0.0019 

Root MSE = 11843 

Robust 

totexp | Coefficient std. err. t P>|t| [95% conf. interval] 
suppins 1050.864 427.2072 2.46 0.014 213.2218 1888.506 


_cons 6420.058 312.626 20.54 0.000 5807 .08 7033.036 


We obtain the same point estimate of 1050.864. The standard error differs 
slightly (427.2072 compared with 427.2127) because of slightly different 
ways that degrees-of-freedom adjustments are made. The ttest command 
uses the ¢(2899.24) distribution, using Satterthwaite’s formula, whereas 
regress uses v = N, + No — 2 = 3062. These degrees-of-freedom 
differences make very little difference here; with few observations, the 
differences are larger. 


3.6 Specification analysis 


The fitted model in section 3.5.2 has R2 — 0.23, which is reasonable for 
cross-sectional data, and most regressors are highly statistically significant 
with the expected coefficient signs. Therefore, it is tempting to begin 
interpreting the results. 


However, before we do so, it is useful to subject this regression to some 
additional scrutiny because a badly misspecified model may lead to 
erroneous inferences. Stata also presents the user with an impressive and 
bewildering menu of choices of diagnostic checks for the currently fitted 
regression; see [R] regress postestimation. Some are specific to OLS 
regression, whereas others apply to most regression models. Some are visual 
aids such as plots of residuals against fitted values. Some are diagnostic 
statistics such as influence statistics that indicate the relative importance of 
individual observations. And some are formal tests that test for the failure of 
one or more assumptions of the model. 


In this section, we focus on detecting and controlling for influential or 
outlying observations using model diagnostics, data transformation, and 
estimators other than OLS. In the subsequent section, we present several 
specification tests, with the notable exception of testing for regressor 
exogeneity, which is deferred to section 7.4.6. 


3.6.1 Robust regression 


The topic of robust regression (as distinct from robust variance estimation of 
regression coefficients) deals with ways to mitigate the limitations of 
standard regression analysis of data with “outliers”, or influential 
observations, that can potentially distort statistical inference. In essence, 
outliers are observations that are generated by a chance mechanism that is 
different from that which is assumed to generate most of the sample. That is, 
the available sample is a mixture of draws from more than one population, 
with extreme observations or outliers contributing a relatively small 
proportion of the sample. 


Despite this fact, these few observations may create significant 
distortions. This provides motivation for detection or elimination, or both, of 
such observations. This task is challenging because outliers and extreme 
observations may be hard to differentiate in a small sample. Extreme 
observations, by definition, are low-probability events, but they are also 
draws from the same distribution that is assumed to generate the remaining 
observations in the sample. Hence, the case for eliminating such 
observations is weaker. 


In the simple regression y = x’G + u, the y-outliers can be generated by 
either x-outliers (the signal component) or u-outliers (the noise component), 
or both. Closely related to the notion of u-outliers is the idea of fat-tailed or 
heavy-tailed distributions that could generate samples with a higher 
proportion of extreme Y-observations—relative to some benchmark 
distribution such as the Gaussian. 


Outliers may result from measurement errors due to equipment 
malfunction, data coding errors, or data generated under unique 
circumstances. Because outliers are likely to distort regression results more 
severely when the sample is small, detection and elimination are more 
feasible in such cases. Distinguishing between outliers and heavy-tailed 
distributions may be difficult even in a large sample. Often, the objective is 
to make inference robust to the presence of such observations; see Huber and 
Ronchetti (2009). 


Categorizing an observation or a group of observations as outliers, often 
on the basis of regression residuals, depends on the assumed benchmark 
error distribution or the functional form of the regression. Specifying that 
functional form involves the choice of transformation. Hence, the topic of 
robust regression is related to choice of variable transformations in 
regression. 


Transformations involve rewriting the regression model in terms of 
rescaled variables, which in turn changes the scale parameter of the error 
distribution. The use of a “wrong” functional form may frequently lead to 
residuals that suggest outliers, but the same conclusion may not apply to the 
regression on transformed variables. Hence, outliers are to be considered in 


the context of the maintained functional form; therefore, the selection of 
functional form is a step that should precede outlier detection. 


3.6.2 Residual diagnostic plots 


Diagnostic plots are used less in microeconometrics than in some other 
branches of statistics for several reasons. First, economic theory and 
previous research provide a lot of guidance as to the likely key regressors 
and functional form for a model. Studies rely on this and shy away from 
excessive data mining; see chapter 28 for data-mining methods. Second, 
microeconometric studies typically use large datasets and regressions with 
many variables. Many variables potentially lead to many diagnostic plots, 
and many observations make it less likely that any single observation will be 
very influential, unless data for that observation are seriously miscoded. 


We consider various residual plots that can aid in outlier detection, where 
an outlier is an observation poorly predicted by the model. One way to do 
this is to plot actual values against fitted values of the dependent variable. 
The postestimation command rvfplot gives a transformation of this, 
plotting the residuals wv; = y; — J; against the fitted values 7, — x! 


We have 


. * Plot of residuals against fitted values 
. qui regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 


. rvfplot, msize(tiny) scale(1.2) 


Residuals 
0) 


-2 


-4 


7 8 9 10 11 
Fitted values 


Figure 3.2. Residuals plotted against fitted values after OLS 
regression 


Figure 3.2 does not indicate any extreme outliers, though the three 
observations with a residual less than — 5 may be worth investigating. To do 
so, we need to generate q by using the predict command, detailed in 
section 4.2, and we need to list some details on those observations with 
m < —}5. We have 


. * Details on the outlier residuals 
. predict uhat, residual 
(109 missing values generated) 


. predict yhat, xb 
. list totexp ltotexp yhat uhat if uhat < -5, clean 


totexp ltotexp yhat uhat 
1. 3 1.098612 7.254341 -6.155728 
2. 6 1.791759 7.513358 -5.721598 
3. 9 2.197225 7.631211 -5.433987 


The three outlying residuals are for three observations with the very smallest 
total annual medical expenditures of, respectively, $3, $6, and $9. The model 
evidently greatly overpredicts for these observations, with the predicted 
logarithm of total expenditures (ynhat) much greater than ltotexp. 


Stata provides several other residual plots. The rvpplot postestimation 
command plots residuals against an individual regressor. The avplot 
command provides an added-variable plot, or partial regression plot, that is a 
useful visual aid to outlier detection. Other commands give component-plus- 
residual plots that aid detection of nonlinearities and leverage plots. For 
details and additional references, see [R] regress postestimation diagnostic 
plots. 


Another useful plot is a Q-Q plot, or quantile—quantile plot, that plots the 
quantiles of one variable against those of another variable. If the variables 
have the same distribution aside from centering and scaling, then we expect 
a linear relationship between the quantiles. 


The qqplot command plots the quantiles of one variable against the 
quantiles of a second variable. The qnorm command instead plots the 
quantiles of one variable against the quantiles of a normal distribution with 
mean and variance of those of the first variable. 


The following example creates a Q-Q plot of residuals against the 
normal distribution and a plot of the kernel density estimate of the residuals 
compared with the normal distribution, where residuals come from OLS 
regression of the log-linear model. 


. * Quantile-quantile plot of fitted against actual values 
. local endash = ustrunescape("\u2013") 


. gnorm uhat, msize(small) title("Q°endash°Q plot of residuals versus normal") 
> name(graphi, replace) 


. kdensity uhat, normal legend(off) name(graph2, replace) 
. graph combine graphi graph2, iscale(1.2) ysize(2.5) xsize(6.0) 


Q-Q plot of residuals versus normal Kernel density estimate 
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Figure 3.3. Q-Q plot of residuals against normal and kernel 
density estimate 


The first panel of figure 3.3 indicates that transforming to logs has led to OLS 
residuals that are normally distributed, aside from the smallest residuals. As 
already seen, these smallest residuals correspond to unusually small medical 
expenditures. The second panel also suggests that residuals are 
approximately normal but does not so clearly show the departure from 
normality for the smallest residuals. 


3.6.3 Influential observations 


Some observations may have unusual influence in determining parameter 
estimates and resulting model predictions. 


Influential observations can be detected using one of several measures 
that are large if the residual is large, the leverage measure is large, or both. 
The leverage measure of the jth observation, denoted by h,, equals the ¿th 
diagonal entry in the so-called hat matrix H = X(X’X)~!X. If h; is large, 
then y: has a big influence on its OLS prediction J; because y = Hy. 
Different measures, including h;, can be obtained by using different options 
of predict. 


A commonly used measure is DFITS,;, which can be shown to equal the 
(scaled) difference between predictions of y; with and without the jth 
observation in the OLS regression (so DFITS means “difference in fits”). Large 
absolute values of DFITS indicate an influential data point. One can plot DFITS 
and investigate further observations with outlying values of Drits. A rule of 
thumb is that observations with |DFITS| > 24/ K/N may be worthy of 
further investigation, though for large datasets this rule can suggest that 
many observations are influential. 


The dfits option of predict can be used after regress provided that 
regression is with default standard errors because the underlying theory 
presumes homoskedastic errors. We have 


* Compute dfits that combines outliers and leverage 
. qui regress ltotexp suppins phylim actlim totchr age female income 


. predict dfits, dfits 
(109 missing values generated) 


scalar threshold = 2*sqrt((e(df_m)+1)/e(N)) 


. display "dfits threshold = " 7%6.3f threshold 
dfits threshold = 0.104 


. tabstat dfits, statistics(min p1 p5 p95 p99 max) format (%9.3f) 
> columns(statistics) 


Variable Min pl ps p95 p99 Max 


dfits -0.421 -0.147 -0.083 0.085 0.127 0.221 


. list dfits totexp ltotexp yhat uhat if abs(dfits) > 2*threshold & e(sample), 
> clean 


dfits totexp ltotexp yhat uhat 

1 -.2319179 3 1.098612 7.254341 -6.155728 
2. - .3002994 6 1.791759 7.513358 -5.721598 
3. -.2765266 9 2.197225 7.631211 -5.433987 
10. -.2170063 30 3.401197 8.348724 -4.947527 
42. -. 2612321 103 4.634729 7.57982 -2.945091 
44. -.4212185 110 4.70048 8.993904 -4.293423 
108. -. 2326284 228 5.429346 7.971406 -2.54206 
114. -. 2447627 239 5.476463 7.946239 -2.469776 
137. -.2177336 233 5.645447 7.929719 -2.284273 
211. -.211344 415 6.028278 8.028338 -2.00006 
2925. . 2207284 62346 11.04045 8.660131 2.380323 


Here over 2% of the sample has |DFITS| greater than the suggested threshold 
of 0.104. But only 11 observations have |DFITS| greater than two times the 
threshold. These correspond to observations with relatively low 
expenditures, or in one case, relatively high expenditures. We conclude that 
no observation has unusual influence. 


3.6.4 Stata’s robust regression 


If one wants to apply a variant of the robust regression estimator without 
first discarding the troublesome observations, then one option is to use the 
rreg command, whose syntax is similar to that of the ordinary regress 
command, 


rreg depvar | indepvars | [ of | lin | Is options | 


In this case, the outliers, or observations that correspond to large 
residuals, will get reduced weight; when the weight is close to zero, the 
observation is effectively dropped from the sample. 


Briefly, the theory behind the rreg command is as follows. Let r; denote 
the residual (y; — x; 6) and r;/o denote the scaled residual obtained by 
scaling the residuals by a robust estimate of the standard deviation 
(implicitly assuming homoskedasticity). Define the objective function for 
robust regression as follows, 


218) = Soa ($) = Son (#2) 


where q;(-) is a weighting function. OLS is the special case in which the 
weighting function is the same quadratic function for all observations. 
Robust regression instead downweights this quadratic function for large 
residuals y; — x; 8. The first-order condition for the optimum is 


N 
Oqi(ri/o) a 
2 ~ ap! x; =0 


where 0q;/03’ is an observation-specific weight. 


Implementation of the rreg command begins with OLS estimation of the 
regression. Observation-specific weights are calculated based on the 
absolute Studentized residuals, and OLS regression is run again using the 
weights. Estimation is iterative because the first-order conditions are solved 
using revised weights in each round, until convergence is achieved to satisfy 
the tolerance criterion. The choice of weights has an important role in the 
speed of convergence. The specifics of the weights are covered in the Stata 
manual entry for the rreg command. 


3.6.5 Median regression 


Another approach for handling outliers and departures from normality is to 
use a method that, unlike OLS and MLE assuming normality, is not based on 
the quadratic loss function. 


Median regression, or least-absolute-deviations regression, is a special 
case of quantile regression that is based on minimizing the sum of absolute 
deviations rather than the sum of squared deviations. Then 3 minimizes 


N 
Q(B) = X lv: — x;Bl 
{=l 


This objective function is less sensitive to extreme observations. The median 
estimator, a member of the class of m estimators, is implemented in Stata as 
a special (default) case of the quantile regression qreg command. A more 
extensive coverage of this estimator is given in chapter 15. 


The median regression estimator offers several advantages over OLS. Its 
target parameter is the conditional median of y, which is less sensitive than 
the conditional mean to skewness and excess kurtosis. It is a semiparametric 
estimator and is known to be less sensitive to departures from normality and 
hence to the presence of outliers. It also does not require homoskedastic 
errors. 


3.6.6 Robust and median regression example 


First, we use the rreg command to implement the Huber-style robust 
regression using the expenditure data. 


* Robust regression as a check on fat tails 
. qui use mus203mepsmedexp, replace 


. rreg ltotexp suppins phylim actlim totchr age female i 


ncome, genwt(w) 


Huber iteration 1: maximum difference in weights = .75846426 
Huber iteration 2: maximum difference in weights = .05754505 
Huber iteration 3: maximum difference in weights = .0112007 
Biweight iteration 4: maximum difference in weights = .29408276 
Biweight iteration 5: maximum difference in weights = .0092795 
Robust regression Number of obs 2,955 
F( 7, 2947) = 119.22 
Prob > F = 0.0000 
ltotexp | Coefficient Std. err. t P>|t| [95% conf. interval] 
suppins . 2469393 0451418 5.47 0.000 . 1584266 . 335452 
phylim . 330684 .0556342 5.94 0.000 . 2215982 . 4397699 
actlim . 3533665 .0606545 5.83 0.000 . 2344371 .4722958 
totchr . 3410272 .0179905 18.96 0.000 .3057521 . 3763024 
age .0056581 . 0035703 1.58 0.113 -.0013425 .0126587 
female -.0843941 .0444756 -1.90 0.058 -.1716005 .0028123 
income . 0026742 .0009955 2.69 0.007 .0007223 .0046261 
_cons 6.643979 . 2702664 24.58 0.000 6.114049 7.173909 


. estimates store ROB_DEF 


The robust regression method necessarily presumes model errors are 
homoskedastic, so there is no heteroskedastic—robust option. 


The option genwt (w) saves in variable w the weights used in the rreg 
procedure. Summary statistics for the weights follow: 


* Weights used for robust regression 
sum w, detail 


Robust Regression Weight 


Percentiles Smallest 

1% . 331044 (0) 

5% .6070053 (0) 
10% . 7331903 (0) Obs 2,955 
25% .8774092 .0079872 Sum of wgt. 2,955 
50% . 9598379 Mean . 9040597 
Largest Std. dev. . 1400978 

75% .9911338 1 
90% . 9983704 1 Variance .0196274 
95% . 9995538 1 Skewness -2.657079 
99% . 9999782 1 Kurtosis 11.72245 

drop w 


Regarding the weights, 75% are between 0.88 and 1.0, and only 1% have 
weight less than 0.33. This suggests that there was little need for robust 
regression here. 


Next, we obtain median regression estimates, the default for the qreg 
command, with heteroskedastic—robust standard errors reported. 


. * Median regression with default standard errors 
. qreg ltotexp suppins phylim actlim totchr age female income, nolog vce(robust) 


Median regression Number of obs = 2,955 
Raw sum of deviations 1555.48 (about 8.111928) 

Min sum of deviations 1368.373 Pseudo R2 = 0.1203 

Robust 

ltotexp | Coefficient std. err. t P>ltl [95% conf. interval] 

suppins . 3019549 .0534219 5.65 0.000 . 1972069 . 4067028 

phylim . 323097 .0657772 4.91 0.000 . 1941231 - 4520709 

actlim . 3199463 .0730748 4.38 0.000 . 1766635 . 4632291 

totchr . 3244109 .0208302 15.57 0.000 . 2835676 . 3652541 

age .0049243 .0041258 1.19 0.233 -.0031654 .013014 

female -.114351 .0533723 -2.14 0.032 -.2190017  -.0097003 

income .0029569 .0006201 4.77 0.000 001741 .0041728 

_cons 6.697871 . 3133837 21.37 0.000 6 . 083398 7 .312344 


. estimates store MED_ROB 


In this example, we have used the nolog option to suppress the iteration log 
that is produced by nonlinear estimation commands such as qreg. More 


generally, the command set iterlog off will suppress the iteration log in 
all commands that by default produce an iteration log. 


Finally, to facilitate comparison, we present a table with results from the 
regress, qreg, and rreg commands. The first three columns report default 
standard errors, while the last two columns report heteroskedastic—robust 
standard errors that are available for OLS and median regression but not for 
robust regression. 


. * Compare OLS, robust regression, and median regression 
. qui regress ltotexp suppins phylim actlim totchr age female income 


. estimates store OLS_DEF 

. qui regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 
. estimates store OLS_ROB 

. qui qreg ltotexp suppins phylim actlim totchr age female income, nolog 

. estimates store MED_DEF 

. estimates table OLS_DEF MED_DEF ROB_DEF OLS_ROB MED_ROB, b(%10.4f) se stats(N) 


Variable OLS_DEF MED_DEF ROB_DEF OLS_ROB MED_ROB 
suppins 0.2556 0.3020 0.2469 0.2556 0.3020 
0.0462 0.0541 0.0451 0.0466 0.0534 
phylim 0.3021 0.3231 0.3307 0.3021 0.3231 
0.0570 0.0666 0.0556 0.0577 0.0658 
actlim 0.3560 0.3199 0.3534 0.3560 0.3199 
0.0621 0.0726 0.0607 0.0634 0.0731 
totchr 0.3758 0.3244 0.3410 0.3758 0.3244 
0.0184 0.0215 0.0180 0.0187 0.0208 
age 0.0038 0.0049 0.0057 0.0038 0.0049 
0.0037 0.0043 0.0036 0.0037 0.0041 
female -0.0843 -0.1144 -0.0844 -0.0843 -0.1144 
0.0455 0.0533 0.0445 0.0457 0.0534 
income 0.0025 0.0030 0.0027 0.0025 0.0030 
0.0010 0.0012 0.0010 0.0010 0.0006 
_cons 6.7037 6.6979 6.6440 6.7037 6.6979 
0.2768 0.3237 0.2703 0.2826 0.3134 
N 2955 2955 2955 2955 2955 


Legend: b/se 


In general, the OLs and median regression parameter estimates are not 
directly comparable, because OLS models the conditional mean while median 
regression models the conditional median. Here the use of the logarithmic 
transformation applied to totexp makes the distribution more symmetric, in 


which case the median and mean are similar, so the two estimates are 
similar. And the difference between OLS and robust regression estimates is 
not great here, due again to the logarithmic transformation bringing in 
outliers and also because the leverage of the outliers or extreme 
observations, or both, is small given a sample that is quite large. 


It is interesting to repeat the exercise using the highly skewed variable 
totexp In place of In(totexp) as the dependent variable. We expect that the 
differences between the three estimators in this case will be larger. We obtain 


. * OLS, robust, and median regression for positive level of total expenditures 
. replace totexp = . if totexp <= 0 
(109 real changes made, 109 to missing) 


. qui rreg totexp suppins phylim actlim totchr age female income, genwt(w) 

. estimates store ROB_LEVEL 

. qui qreg totexp suppins phylim actlim totchr age female income, vce(robust) 

. estimates store MED_LEVEL 

. qui regress totexp suppins phylim actlim totchr age female income, vce(robust) 
. estimates store OLS_LEVEL 

. estimates table OLS_LEVEL MED_LEVEL ROB_LEVEL, b(%10.4f) se stats(N) 


Variable OLS_LEVEL MED_LEVEL ROB_LEVEL 
suppins 724.8632 784.3740 725.8082 
427 . 3045 131.0828 115.1799 

phylim 2389.0186 901.9890 638.8099 
544.3493 245.2875 141.9514 

actlim 3900. 4908 1539.7285 673.1755 
705.2244 349.2155 154.7606 

totchr 1844.3769 1114.7952 823.2710 
186.8938 70.5389 45.9029 

age -85.3626 -1.8305 17 . 4398 
37.8187 11.3550 9.1098 

female -1383.2898 -131.1867 -77 .9987 
432.4759 149.8859 113.4801 

income 6.4689 7.1615 6.0682 
8.5707 3.3079 2.5400 

_cons 8358 .9539 512.7608 -230.2268 
2847 .8024 848 . 4860 689.5881 

N 2955 2955 2955 


Legend: b/se 


The differences across estimation methods in the estimates is much greater, 
especially for the coefficients of phylim, actlim, and female. The linear 
version has heteroskedastic and skewed errors, and the three estimators show 
different sensitivity to these features. Note that it would not be correct to 
attribute these differences to outliers in the regressors. As was seen in the 
previous table, there was very little variation between estimators in the case 
of regression with a log-transformed dependent variable. 


3.7 Specification tests 


There are several ways in which a regression may be misspecified. Omitted 
variables and misspecified functional form are two important reasons for a 
regression either not fitting the data well or generating poor predictions in 
the postsample period. This provides motivation for statistical tests of 
functional forms. 


Linearity of the conditional mean function implies 
E(y|x) = x'8 = DA (x47, Which is a strong assumption. Specifically, 
the MEs OE (y|x)/Ox; = 6; are constant and independent of the level of x. 
If we would prefer a more flexible parametric functional form, the 
conditional mean function can be modified, or else we can test for the 
presence of nonlinearity of E (y|x). 


1. For a selected subset of regressors, we can generate nonlinearity by 
adding polynomial terms such as z? and x. to the conditional mean 
function. Using standard ¢ tests or F tests, we can then check whether 
the coefficients of these additional variables are statistically significant. 

2. We can generate a set of interaction variables such as Zij£ik and add 
them to the conditional mean function. Again, using standard ¢ tests or 
F tests, we can check whether these variables generate a significant 
nonlinearity in the conditional mean function. 

3. Using an initial specification of the regression, we first fit the model. 
Then using fitted values x! Bs we generate new variables (x! B), 
(x!3)3, saves (x!B)?. These p generated regressors are added to the 
original regression, which is reestimated, and an F test of the joint 
significance of the generated regressors is applied. Rejection of the null 
hypothesis that the added variables are jointly insignificant is evidence 
of departure from linearity. The test is known as the RESET test 
(Cameron and Trivedi 2005, 277—278). 


The first two of these three tests are less mechanical than the last because 
they motivate one to consider how the nonlinearity may arise and possible 
economic justification for any specific nonlinearity. For example, we may 
have reason to expect that the ME with respect to Vij may depend upon Tik. 


Additionally, formal model-specification tests exist whereby rejection of 
the null hypothesis means that key assumptions of the model are rejected. 
Such tests have two limitations. First, a test for the failure of a specific 
model assumption may not be robust with respect to the failure of another 
assumption that is not under test. For example, the rejection of the null 
hypothesis of homoskedasticity may be due to a misspecified functional 
form for the conditional mean. An example is given in section 3.7.6. Second, 
with a very large sample, even trivial deviations from the null hypothesis of 
correct specification will cause the test to reject the null hypothesis. For 
example, if a previously omitted regressor has a very small coefficient, say, 
0.000001, then even with an infinitely large sample the estimate will be 
sufficiently precise that we will always reject the null of zero coefficient. 


3.7.1 Test of omitted variables 


In microeconometrics, the most common approach to deciding on the 
adequacy of a model is a Wald-test approach that fits a richer model and 
determines whether the data support the need for a richer model. For 
example, we may add additional regressors to the model and test whether 
they have a zero coefficient. 


The additional regressor may be a variable not already included, a 
transformation of a variable already included such as a quadratic in age, or a 
quadratic with interaction terms in age and education. If groups of regressors 
are included, such as a set of region dummies, test can be used after 
regress to perform a joint test of statistical significance. 


In some branches of biostatistics, it is common to include only regressors 
with p < 0.05. In microeconometrics, it is common instead to additionally 
include regressors that are statistically insignificant at, say, level 0.05 if 
economic theory or conventional practice includes the variable as a control. 
This reduces the likelihood of inconsistent parameter estimation due to 
omitted-variables bias at the expense of reduced precision in estimation. 


3.7.2 Test of the functional form of the conditional mean 


The linear regression model specifies that the conditional mean of the 
dependent variable (whether measured in levels or in logs) equals x; 8. A 
standard test that this is the correct specification is the RESET variable 
augmentation test. 


The estat ovtest postestimation command provides a RESET test that 
regresses Y on x and 7, 7°, and Ņ* and jointly tests that the coefficients of 
J y°, and 7 are 0. We have 


. * Variable augmentation test of conditional mean using estat ovtest 
. qui regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 


. estat ovtest 


Ramsey RESET test for omitted variables 
Omitted: Powers of fitted values of ltotexp 


HO: Model has no omitted variables 


F(3, 2944) 9.04 
Prob > F = 0.0000 


The model is strongly rejected because p = 0.000. 


An alternative, simpler test is provided by the 1inktest command. This 
regresses Yy on y and y, where now the original model regressors x are 
omitted, and it tests whether the coefficient of 7? is 0. We have 


. * Link test of functional form of conditional mean 
. qui regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 


. Linktest 


Source SS df MS Number of obs = 2,955 
F(2, 2952) = 454.81 

Model 1301.41696 2 650.708481 Prob > F = 0.0000 
Residual 4223 .47242 2,952 1.43071559 R-squared = 0.2356 
Adj R-squared = 0.2350 

Total 5524 .88938 2,954 1.87030785 Root MSE = 1.1961 
ltotexp | Coefficient Std. err. t P>|tl [95% conf. interval] 
-hat 4.429216 .6779517 6.53 0.000 3.09991 5.758522 
_hatsq -.2084091 .0411515 -5.06 0.000 -.2890976 -.1277206 
_cons -14.01127 2.779936 -5.04 0.000 -19.46208 -8.56046 


Again, the null hypothesis that the conditional mean is correctly specified is 
rejected. A likely reason is that so few regressors were included in the model 
for pedagogical reasons. 


The two preceding commands had different formats. The first test used 
the estat ovtest command, where estat produces various statistics 
following estimation, and the particular statistics available vary with the 
previous estimation command. The second test used Linktest, which is 
available for a wider range of models. 


3.7.3 Test of levels versus logs 


A common specification-testing approach is to fit a richer model that tests 
the current model as a special case and performs a Wald test of the parameter 
restrictions that lead to the simpler model. The preceding omitted-variable 
test is an example. 


Here we consider a test specific to the current example. We want to 
decide whether a regression model for medical expenditures is better in logs 
than in levels. There is no obvious way to compare the two models because 
they have different dependent variables. However, the Box—Cox transform 
leads to a richer model that includes the linear and log-linear models as 
special cases. Specifically, we fit the model with the transformed dependent 
variable 


Ue 
(yi, 9) = + — =x, B+ u 


where o and 8 are estimated under the assumption that u; ~ N(0,07). 
Three leading cases are 1) g(y, 0) = y — 1 if 6 = 1; 2) g(y, 0) = In y if 
6 = 0; and 3) g(y, 0) = 1 — 1/y if 6 = —1. The log-linear model is 
supported if g is close to 0, and the linear model is supported if g — 1. 


The Box—Cox transformation introduces a nonlinearity and an additional 
unknown parameter @ into the model. This moves the modeling exercise into 
the domain of nonlinear models. The model is straightforward to fit, 
however, because Stata provides the boxcox command to fit the model. We 
obtain 


. * Boxcox model with lhs variable transformed 
. boxcox totexp suppins phylim actlim totchr age female income if totexp>0, nolog 
Fitting comparison model 


Fitting full model 


Number of obs = 2,955 
LR chi2(7) = 773.02 
Log likelihood = -28518.267 Prob > chi2 = 0.000 
totexp | Coefficient Std. err. z P>lz| [95% conf. interval] 
/theta .0758956 . 0096386 7.87 0.000 .0570042 .0947869 
Estimates of scale-variant parameters 
Coefficient 
Notrans 
suppins . 4459618 
phylim .577317 
actlim . 6905939 
totchr . 6754338 
age .0051321 
female -.1767976 
income . 0044039 
cons 8.930566 
/sigma 2.189679 
Test Restricted LR statistic 
HO: log likelihood chi2 Prob > chi2 
theta = -1 -37454 .643 17872.75 0.000 
theta = 0 -28550 .353 64.17 0.000 
theta = 1 -31762.809 6489.08 0.000 


The null hypothesis of 6 = 0 is strongly rejected, so the log-linear model is 
rejected. However, the Box—Cox model with general @ is difficult to interpret 
and use, and the estimate of @ — 9.9759 gives much greater support for a 
log-linear model (0 = 0) than the linear model (9 = 1). Thus, we prefer to 
use the log-linear model. 


3.7.4 Heteroskedasticity test 


One consequence of heteroskedasticity is that default OLs standard errors are 
incorrect. This can be readily corrected and guarded against by routinely 
using heteroskedasticity-robust standard errors. 


Nonetheless, there may be interest in formally testing whether 
heteroskedasticity is present. For example, the retransformation methods for 
the log-linear model used in section 4.2.3 assume homoskedastic errors. In 
section 6.3, we present diagnostic plots for heteroskedasticity. Here we 
instead present a formal test. 


A quite general model of heteroskedasticity is 


Var(y|x) = h (ay + z’a2) 


where h(-) is a positive monotonic function such as exp(-) and the variables 
in z are functions of the variables in x. Tests for heteroskedasticity are tests 
of 


Ho: ag =0 


and can be shown to be independent of the choice of function h(-). We reject 
Ho at the a level if the test statistic exceeds the a critical value of a chi- 
squared distribution with degrees of freedom equal to the number of 
components of z. The test is performed by using the estat hettest 
postestimation command. The simplest version is the Breusch—Pagan 
Lagrange multiplier test, which is equal to N times the uncentered explained 
sum of squares from the regression of the squared residuals on an intercept 
and z. We use the iid option to obtain a different version of the test that 
relaxes the default assumption that the errors are normally distributed. 


Several choices of the components of z are possible. By far, the best 
choice is to use variables that are a priori likely determinants of 
heteroskedasticity. For example, in regressing the level of earnings on 
several regressors, including years of schooling, one will likely find that 
those with many years of schooling have the greatest variability in earnings. 


Such candidates rarely exist. Instead, standard choices are to use the OLS 
fitted value vy, the default for estat hettest, or to use all the regressors so 
z = x. White’s test for heteroskedasticity is equivalent to letting z equal 
unique terms in the products and cross products of the terms in x. 


We consider tests of heteroskedasticity with both z = y and z = x. Then 
we have 


* Heteroskedasticity tests using estat hettest and option iid 
. qui regress ltotexp suppins phylim actlim totchr age female income 


. estat hettest, iid 


Breusch-Pagan/Cook-Weisberg test for heteroskedasticity 
Assumption: i.i.d. error terms 
Variable: Fitted values of ltotexp 


HO: Constant variance 


chi2(1) 32.87 
Prob > chi2 = 0.0000 


. estat hettest suppins phylim actlim totchr age female income, iid 


Breusch-Pagan/Cook-Weisberg test for heteroskedasticity 
Assumption: i.i.d. error terms 

Variables: suppins phylim actlim totchr age female income 
HO: Constant variance 


chi2(7) = 93.13 
Prob > chi2 0.0000 


Both versions of the test, with z = y and with z = x, have p = 0.0000 and 
strongly reyect homoskedasticity. 


3.7.5 Omnibus test 


An alternative to separate tests of misspecification is an omnibus test, which 
is a joint test of misspecification in several directions. A leading example is 
the information matrix (IM) test (see section 11.9), which is a test for correct 
specification of a fully parametric model based on whether the IM equality 
holds. For linear regression with normal homoskedastic errors, the IM test 
can be shown to be a joint test of heteroskedasticity, skewness, and 
nonnormal kurtosis compared with the null hypothesis of homoskedasticity, 
symmetry, and kurtosis coefficient of 3; see Hall (1987). 


The estat imtest postestimation command computes the joint IM test 
and also splits it into its three components. We obtain 


. * Information matrix test 
. qui regress ltotexp suppins phylim actlim totchr age female income 


. estat imtest 


Cameron & Trivedi°s decomposition of IM-test 


Source chi2 df p 
Heteroskedasticity 139.90 31 0.0000 
Skewness 35.11 7 0.0000 

Kurtosis 11.96 1 0.0005 

Total 186.97 39 0.0000 


The overall joint IM test rejects the model assumption that y ~ N(x’ G, 071) 
because p = 0.0000 in the Total row. The decomposition indicates that all 
three assumptions of homoskedasticity, symmetry, and normal kurtosis are 
rejected. Note, however, that the decomposition assumes correct 
specification of the conditional mean. If instead the mean is misspecified, 
then that could be the cause of rejection of the model by the IM test. 


3.7.6 Tests have power in more than one direction 


Tests can have power in more than one direction, so that if a test targeted to a 
particular type of model misspecification rejects a model, it is not 
necessarily the case that this particular type of model misspecification is the 
underlying problem. For example, a test of heteroskedasticity may reject 
homoskedasticity, even though the underlying cause of rejection is that the 
conditional mean is misspecified rather than that errors are heteroskedastic. 


To illustrate this example, we use the following simulation exercise. The 
DGP is one with homoskedastic normal errors, 


yi = exp(1 + 0.25 x xi +4 x £?) + ui 


We instead fit a model with a misspecified conditional mean function: 


y = Bo + Bix + Box? + 


We consider a simulation with a sample size of 50. We generate the 
regressors and the dependent variable by using commands detailed in 
section 5.2. We obtain 


* Simulation to show tests have power in more than one direction 
clear all 


set obs 50 
Number of observations (_N) was 0, now 50. 


set seed 10101 
. generate x = runiform() // x ~ uniform(0,1) 
. generate u = rnormal() // u ~ N(O,1) 
. generate y = exp(1 + 0.25*x + 4*x72) + u 
. generate xsq = x°2 


. regress y x xsq 


Source SS df MS Number of obs = 50 
F(2, 47) = 182.17 

Model 19660.0057 2 9830.00286 Prob > F = 0.0000 
Residual 2536.17254 47 53.9611179 R-squared = 0.8857 
Adj R-squared = 0.8809 

Total 22196.1783 49 452.98323 Root MSE = 7.3458 

y | Coefficient Std. err. t P>|t| [95% conf. interval] 

x -101.1834 14.73595 -6.87 0.000 -130.8283 -71.53845 

xsq 186.1681 16.56718 11.24 0.000 152.8392 219.4969 
_cons 11.48277 2.590577 4.43 0.000 6.271202 16.69434 


The misspecified model seems to fit the data very well with highly 
statistically significant regressors and an R2 of 0.89. 


Now consider a test for heteroskedasticity: 


. * Test for heteroskedasticity 
. estat hettest 


Breusch-Pagan/Cook-Weisberg test for heteroskedasticity 
Assumption: Normal error terms 

Variable: Fitted values of y 

HO: Constant variance 


chi2(1) 62.41 
Prob > chi2 0.0000 


This test strongly suggests that the errors are heteroskedastic because 
p = 0.0000, even though the par had homoskedastic errors. 


The problem is that the regression function itself was misspecified. A 
RESET test yields 


. * Test for misspecified conditional mean 
. estat ovtest 


Ramsey RESET test for omitted variables 
Omitted: Powers of fitted values of y 


HO: Model has no omitted variables 


F(3, 44) = 824.55 
Prob > F 0.0000 


This strongly rejects correct specification of the conditional mean because 
p = 0.0000. 


Going the other way, could misspecification of other features of the 
model lead to rejection of the conditional mean, even though the conditional 
mean itself was correctly specified? This is an econometrically subtle 
question. The answer, in general, is yes. However, for the linear regression 
model, this is not the case essentially because consistency of the OLS 
estimator requires only that the conditional mean be correctly specified. 


3.8 Sampling weights 


The analysis to date has presumed simple random sampling, where sample 
observations have been drawn from the population with equal probability. In 
practice, however, many microeconometric studies use data from surveys 
that are not representative of the population. Instead, groups of key interest 
to policymakers that would have too few observations in a purely random 
sample are oversampled, with other groups then undersampled. Examples 
are individuals from racial minorities or those with low income or living in 
sparsely populated states. 


As explained below, weights should be used for estimation of population 
means and for postregression prediction and computation of MEs. However, 
in most cases, the regression itself can be fit without weights, as is the norm 
in microeconometrics. If weighted analysis is desired, it can be done using 
standard commands with a weighting option, which is the approach of this 
section and the standard approach in microeconometrics. Alternatively, one 
can use survey commands as detailed in section 6.9. 


3.8.1 Weights 


Sampling weights are provided by most survey datasets. These are called 
probability weights or pweights in Stata, though some others call them 
inverse-probability weights because they are inversely proportional to the 
probability of inclusion in the sample. A pweight of 1,400 in a survey of the 
U.S. population, for example, means that the observation is representative of 
1,400 U.S. residents and that the probability of this observation being 
included in the sample is 1/1400. 


Most estimation commands allow probability-weighted estimators that 
are obtained by adding [pweight=weight], where weight is the name of the 
weighting variable. 


To illustrate the use of sampling weights, we create an artificial 
weighting variable (sampling weights are available for the Medical 
Expenditure Panel Survey data but were not included in the data extract used 


in this chapter). We manufacture weights that increase the weight given to 
those with more chronic problems. In practice, such weights might arise if 
the original sampling framework oversampled people with few chronic 
problems and undersampled people with many chronic problems. In this 
section, we analyze levels of expenditures, including expenditures of zero. 
Specifically, 


. * Create artificial sampling weights 
. qui use mus203mepsmedexp, clear 


. generate swght = totchr™2 + 0.5 
. summarize swght 
Variable | Obs Mean Std. dev. Min Max 


swght | 3,064 5.285574 6.029423 .5 49.5 


What matters in subsequent analysis is the relative values of the sampling 
weights rather than the absolute values. The sampling weight variable swght 
takes on values from 0.5 to 49.5, so weighted analysis will give some 
observations as much as 49.5/0.5 = 99 times the weight given to others. 


Stata offers three other types of weights that for most analyses can be 
ignored. Analytical weights, called aweights, are used for the quite different 
purpose of compensating for different observations having different 
variances that are known up to scale; see section 6.3.4. For duplicated 
observations, fweights provide the number of duplicated observations. So- 
called importance weights, or iweights, are sometimes used in more 
advanced programming. 


3.8.2 Weighted mean 


If an estimate of a population mean is desired, then we should clearly 
weight. In this example, by oversampling those with few chronic problems, 
we will have oversampled people who on average have low medical 
expenditures, so that the unweighted sample mean will understate population 
mean medical expenditures. 


Let w; be the population weight for individual ¿. Then, by defining 
W = a _ w; to be the sum of the weights, we see the weighted mean Yw 


iS 


with variance estimator (assuming independent observations) 
V (yw) = {1/W(W — 1)} D wilyi — Yw)”. These formulas reduce to 
those for the unweighted mean if equal weights are used. 


The weighted mean downweights oversampled observations because 
they will have a value of pweights (and hence w;) that is smaller than that 
for most observations. We have 


. * Calculate the weighted mean 
. mean totexp [pweight=swght] 


Mean estimation Number of obs = 3,064 


Mean Std. err. [95% conf. interval] 


totexp 10670.83 428.5148 9830.62 11511.03 


The weighted mean of $10,671 is much larger than the unweighted mean of 
$7,031 (see section 3.2.4) because the unweighted mean does not adjust for 
the oversampling of individuals with few chronic problems. 


3.8.3 Weighted regression 


The weighted least-squares estimator for the regression of y; on x; with the 
weights w; is given by 


N 

a 1 

Bw = X WiXiX; X WiXiYi 
i=1 i=1 


The ors estimator is the special case of equal weights with Wi = w; for all 7 
and j. The regress command when used with weights reports by default an 
estimator of the vce that is a weighted version of the heteroskedasticity- 
robust version in (3.3), which assumes independent observations. So if errors 
are heteroskedastic, the vce (robust) option is redundant, while if 
observations are clustered, then the option vce (cluster clustvar) should 
be used. 


For our data example, we obtain 


. * Perform weighted regression 
. regress totexp suppins phylim actlim totchr age female income [pweight=swght] 
(sum of wgt is 16,195) 


Linear regression Number of obs = 3,064 
F(7, 3056) = 14.08 
Prob > F = 0.0000 
R-squared = 0.0977 
Root MSE = 13824 

Robust 
totexp | Coefficient std. err. t P>|t| [95% conf. interval] 
suppins 278.1578 825.6959 0.34 0.736 -1340.818 1897 . 133 
phylim 2484.52 933.7116 2.66 0.008 653.7541 4315.286 
actlim 4271.154 1024.686 4.17 0.000 2262.011 6280.296 
totchr 1819.929 349.2234 5.21 0.000 1135.193 2504.666 
age -59.3125 68.01237 -0.87 0.383 -192.6671 74.04212 
female -2654.432 911.6422 -2.91 0.004 -4441.926 -866.9381 
income 5.042348 16.6509 0.30 0.762 -27 .60575 37 . 69045 
-cons 7336.758 5263.377 1.39 0.163 -2983.359 17656.87 


The estimated coefficients of all statistically significant variables aside from 
female are within 10% of those from unweighted regression (not given for 
brevity). Big differences between weighted and unweighted regression 
would indicate that E (u;|x;) 4 0 because of model misspecification. Note 
that heteroskedastic—robust standard errors are reported by default. 


Although the weighted estimator is easily obtained, for legitimate 
reasons many microeconometric analyses do not use weighted regression 
even where sampling weights are available. We provide a brief explanation 
of this conceptually difficult issue. For a more complete discussion, see 
Cameron and Trivedi (2005, 818-821) and Solon, Haider, and 
Wooldridge (2015). 


Weighted regression should be used if a census parameter estimate is 
desired. For example, suppose we want to obtain an estimate for the 
U.S. population of the average change in earnings associated with one more 
year of schooling. Then, if disadvantaged minorities are oversampled, we 
most likely will understate the earnings increase because disadvantaged 
minorities are likely to have earnings that are lower than average for their 
given level of schooling. A second example is when aggregate state-level 
data are used in a natural experiment setting, where the goal is to measure 
the effect of an exogenous policy change that affects some states and not 
other states. Intuitively, the impact on more populous states should be given 
more weight. Note that these estimates are being given a correlative rather 
than a causal interpretation. 


Weighted regression is not needed if we make the stronger assumptions 
that the DGP is the specified model y; = x; 6B + u; and sufficient controls are 
assumed to be added so that the error E'(u;|x;) = 0. This approach, called a 
model approach, is the approach usually taken in microeconometric studies 
that emphasize a causal interpretation of regression. Under the assumption 
that E(u;|x;) = 0, the weighted least-squares estimator will be consistent 
for 8 for any choice of weights including equal weights, and if u; is 
homoskedastic, the most efficient estimator is the OLS estimator, which uses 
equal weights. For the assumption that E'(u;|x;) = 0 to be reasonable, the 
determinants of the sampling frame should be included in the controls x and 
should not be directly determined by the dependent variable y. 


These points carry over directly to nonlinear regression models. In most 
cases, microeconometric analyses take on a model approach. In that case, 
unweighted estimation is appropriate, with any weighting based on 
efficiency grounds. 


Leading cases where weights are used are the following. First, if a 
census-parameter approach is being taken, then it is necessary to weight. 
Second, for panel data where some individuals leave the panel over time, 
introducing possible selection bias, then we may weight by the inverse of the 
probability of leaving the panel; see section 19.11. Third, if individuals self- 
select into a treatment program, then one way to estimate the effect of 
treatment on an outcome, while controlling for this self-selection, is to 


weight by the inverse of the predicted probability of self-selection into the 
program; see section 24.6.2. 


3.8.4 Weighted prediction and MEs 


After regression, unweighted prediction will provide an estimate of the 
sample-average value of the dependent variable. We may instead want to 
estimate the population-mean value of the dependent variable. Then 
sampling weights should be used in forming an average prediction. 


This point is particularly easy to see for OLS regression. Because 
1/N 9° ;,(y%i — yi) = 0, because in-sample residuals sum to zero if an 
intercept is included, the average prediction 1/N )°, Y; equals the sample 
mean y. But given an unrepresentative sample, the unweighted sample mean 
y may be a poor estimate of the population mean. Instead, we should use the 
weighted average prediction 1/N 5°, w:Ņi, even if %; is obtained by using 
unweighted regression. 


For this to be useful, however, the prediction should be based on a model 
that includes as regressors variables that control for the unrepresentative 
sampling. 


For our example, we obtain the weighted prediction by typing 


. * Weighted prediction 
. qui predict yhatwols 


. mean yhatwols [pweight=swght], noheader 


Mean Std. err. [95% conf. interval] 
yhatwols 10670.83 138.0828 10400.08 10941.57 
. mean yhatwols, noheader // Unweighted prediction 
Mean Std. err. [95% conf. interval] 
yhatwols 7135.206 78.57376 6981.144 7289.269 


The population mean for medical expenditures is predicted to be $10,671 
using weighted prediction, whereas the unweighted prediction gives a much 


lower value of $7,135. 


Weights similarly should be used in computing average MEs. For the 
linear model, the standard ME OE (y;|x;)/Ox,; equals £; for all observations, 
so weighting will make no difference in computing the ME. Weighting will 
make a difference for averages of other MEs, such as elasticities, and for MEs 
in nonlinear models. 


3.9 OLS using Mata 


Stata offers two different ways to perform computations using matrices: 
Stata matrix commands and Mata functions (which are discussed, 
respectively, in appendixes A and B). 


Mata is much richer because it is a matrix programming language. We 
illustrate the use of Mata by using the same OLS regression as that in 
section 3.5.2. 


The program is written for the dependent variable provided in the local 
macro y and the regressors in the local macro xlist. We begin by reading in 
the data and defining the local macros. 


. * OLS with White robust standard errors using Mata 
. qui use mus203mepsmedexp, clear 


. keep if totexp > 0 // Analysis for positive medical expenditures only 
(109 observations deleted) 


. generate cons = 1 
. local y ltotexp 


. local xlist suppins phylim actlim totchr age female income cons 


We then move into Mata. The st_view() Mata function is used to 
transfer the Stata data variables to Mata matrices y and x, with tokens ("") 
added to convert ‘xlist’ to a comma-separated list with each entry in 
double quotes, necessary for st_view(). 


The key part of the Poran forms B= (X’X)~!X’y and 


V(B) = (N/N — K) (%'X)71 (0, xxL) (XK). The cross-product 
function cross (x, X) is used to form X’X because this handles missing 
values and is more efficient than the more obvious x’ *x. The matrix inverse 
is formed by using cholinv() because this is the fastest method in the 
special case that the matrix is symmetric positive definite. We calculate the 
K x K matrix X`, @?x;x/,as 90, (t;x/)’(uix)) = A'A , where the N x K 
matrix A has an ith row equal to u;x’/. Now x; equals the jth row of the 
N x 1 residual vector q times the jth row of the N x K regressor matrix X, 
so A can be computed by element-by-element multiplication of G by X, or 


(e:*xX), where e is q. Alternatively, $., u?x;x’ = X'DX, where D is an 
N x N diagonal matrix with entries 7, but the matrix D becomes 
exceptionally large, unnecessarily so, for a large N. 


The Mata program concludes by using st_matrix() to pass the 
estimated 3 and V(B) back to Stata. 


. mata 


: end 


mata (type end to exit) 
// Create y vector and X matrix from Stata dataset 


st_view(y=., 29° 7) // y is nx1 

st_view(X=., ., tokens(""xlist~")) // X is nxk 

XXinv = cholinv(cross(X,X)) // XXinv is inverse of X°X 

b = XXinv*cross(X,y) // b = [(X°X)7*-1] *X"y 

e = y - X*b 

n = rows(X) 

k = cols(X) 

s2 = (e°e)/(n-k) 

vdef = s2*XXinv // Default VCE not used here 
vwhite = XXinv*((e:*X) “(e:*X)*n/(n-k))*XXinv // Robust VCE 
st_matrix("b",b~) // Pass results from Mata to Stata 
st_matrix("V",vwhite) // Pass results from Mata to Stata 


Once back in Stata, we use ereturn to display the results in a format 
similar to that for official commands, first assigning name stripes to the 
columns and rows of b and v. 


* Use Stata ereturn display to present nicely formatted results 


. Matrix colnames b = ~xlist”™ 
. Matrix colnames V = ~xlist” 
. Matrix rownames V = ~xlist”™ 


. ereturn post b V 


. ereturn display 


Coefficient Std. err. z P>|z| [95% conf. interval] 

suppins . 2556428 .0465982 5.49 0.000 . 1643119 . 3469736 
phylim . 3020598 .057705 5.23 0.000 . 18896 -4151595 
actlim . 3560054 . 0634066 5.61 0.000 . 2317308 . 48028 
totchr .3758201 .0187185 20.08 0.000 . 3391326 .4125077 
age . 0038016 . 0037028 1.03 0.305 -.0034558 .011059 
female -.0843275 .045654 -1.85 0.065 -.1738076 .0051526 
income .0025498 .0010468 2.44 0.015 .0004981 .0046015 
cons 6.703737 . 2825751 23.72 0.000 6.1499 7.257575 


The results are exactly the same as those given in section 3.5.2, when we 
used regress with the vce (robust) option. 


3.10 Additional resources 


The key Stata references are [U] Stata User s Guide and [R] regress, 

[R] regress postestimation, [R] estimates, [R] predict, and [R] test. The 
material in this chapter appears in many econometrics texts, such as 
Greene (2018) and Hansen (2022a); see also Hansen (2022b) for univariate 
Statistics. 


Stata version 15 introduced a suite of commands, including putdocx 
and dyndoc, for exporting estimation results, summary statistics, and graphs 
to produce formatted reports in Word, Excel, PDF, and HTML. Furthermore, 
this can be done in a dynamic manner, so that any change in Stata output is 
immediately incorporated in the document. For details, see [RPT] Stata 
Reporting Reference Manual. For formatted output, Stata 17 introduced the 
collect command and a more flexible version of the table command. 


3.11 Exercises 


1. Fit the model in section 3.5 using only the first 100 observations. 
Compute standard errors in three ways: default, heteroskedastic, and 
cluster—robust where clustering is on the number of chronic problems. 
Use estimates to produce a table with three sets of coefficients and 
standard errors, and comment on any appreciable differences in the 
standard errors. Construct a similar table for three alternative sets of 
heteroskedasticity-robust standard errors, obtained by using the 
vce (robust), vce (he2), and vce (hc3) options, and comment on any 
differences between the different estimates of the standard errors. 

2. Fit the model in section 3.5 with robust standard errors reported. Test 
at 5% the joint significance of the demographic variables age, female, 
and income. Test the hypothesis that being male (rather than female) 
has the same impact on medical expenditures as aging 10 years. Fit the 
model under the constraint that Gpny1im = Pactiim by first typing 
constraint 1 phylim = actlim and then by using cnsreg with the 
constraints (1) option. 

3. Fit the model in section 3.6, and implement the RESET test manually by 
regressing Y on x and 7”, 99, and 7* and jointly testing that the 
coefficients of 7, 73, and 7 are 0. To get the same results as estat 
ovtest, do you need to use default or robust estimates of the VCE in 
this regression? Comment. Similarly, implement 1inktest by 
regressing Y on y and 7? and testing that the coefficient of 7? is 0. To 
get the same results as 1inktest, do you need to use default or robust 
estimates of the vcE in this regression? Comment. 

4. Fit the model in section 3.6, and perform the standard Lagrange 
multiplier test for heteroskedasticity by using estat hettest with 
z = x. Then, implement the test manually as 0.5 times the explained 
sum of squares from the regression of y; on an intercept and z;, where 
y; = {uz /(1/N) X; uj} — 1 and @; is the residual from the original 
OLS regression. Next, use estat hettest with the iid option, and 
show that this test is obtained as N x R?, where R2 is obtained from 
the regression of &? on an intercept and 2. 

5. Using the DGP of section 3.7.6, generate a sample of size 100. Regress 
y on x and an intercept. Apply the RESET omitted variable test. Does 


the test indicate functional form misspecification? Next, use the estat 
imtest postestimation command to generate IM omnibus diagnostic 
test statistics. If you had applied only this test, what changes to your 
model specification would you make? Would you make additional 
tests before changing the model specification? 

. One type of data transformation that is sometimes used in linear 
regression, though not much in econometrics, is variable 
standardization. To standardize a variable you subtract the sample 
mean of that variable from each observation and divide the result by 
the sample standard deviation of that variable. The resulting variable 
with mean zero and standard deviation one is independent of the 
measurement scale of original data. A standardized regression is a 
regression in which the dependent and regressor variables are all 
standardized before running the regression. The standardized 
regression coefficients are interpreted as measuring the effect of one 
standard deviation change in the regressor and hence are in a sense 
comparable across variables. Note that when a regressor is discrete, 
interpreting its impact in units of standard deviation is not natural. 
Reestimate the regression equation of section 3.5.2 after standardizing 
all variables. Are the insignificant coefficients the same as those in the 
previous regression? Which variable has the largest coefficient? Does 
that mean it is the most important variable in the regression? 


Chapter 4 
Linear regression extensions 


4.1 Introduction 


In this chapter, we consider several extensions of linear regression that are 
widely used in applied work. These extensions are consequences of the 
reality that an econometric investigation rarely concludes with the 
straightforward reporting of a linear regression. More often, the regression 
is a tool used for a deeper analysis of the data, for testing economic 
hypotheses, for making predictions (in-sample or out-of-sample), and so 
forth. These topics are considered in this chapter. 


The plausibility of an empirical application depends partially on the 
robustness of the fitted model. Hence, various tests considered in 
sections 3.6 and 3.7 are a natural preliminary to the applications considered 
in this chapter because they are designed to reveal weaknesses in the model 
specification. Ideally, one would fix the detected deficiencies of a model 
before proceeding with the applications considered in this chapter. 


All the applications considered are based on the conditional mean 
function, rather than on the higher conditional moments. This raises the 
question whether the misspecification of other features of the model, 
including higher moments, may lead to inconsistent estimation of the 
conditional mean, even if the model for the conditional mean is correctly 
specified. This is an econometrically subtle question. The answer, in 
general, is yes. However, for the linear regression model, this is not the case 
essentially because consistency of the OLS estimator requires only that the 
conditional mean function be correctly specified. 


Four main applications of linear regression are considered. These are 
prediction (in-sample and out-of-sample), estimation of various marginal 
effects (MEs), regression decomposition, and estimation of treatment effects 
based on difference in differences (DID). 


4.2 In-sample prediction 


For the linear regression model, the estimator of the conditional mean of y 
given X = Xp, E(y|xp) = xô, is the conditional predictor 7 = x! 3. We 
begin with prediction from a linear model for medical expenditures, because 
this is more straightforward, before turning to the log-linear model. 


We present prediction for each observation in the estimation sample in 
this section and out-of-sample prediction in the subsequent section. 


Further details on prediction are given in sections 13.5—13.7, where 
many methods are presented. Weighted average prediction to obtain a 
population-average prediction is presented in section 3.8. 


4.2.1 The predict command 


Predictions are obtained from the predict command after regress. The 
syntax for the predict command is 


predict [ type | newvar [ of | [ in | [ ; statistic | 
The user always provides a name for the created variable, newvar. 


The default is to predict for all observations in the sample, aside from 
any observations with missing regressor values. The qualifier if e (sample) 
ensures that predictions are made only for those observations used to obtain 
the estimates. For example, if the estimation sample was restricted to 
women, then the qualifier if e (sample) ensures prediction is only for 
women. Similarly, if data are available for all regressors but are missing for 
some observations of the dependent variable, then this qualifier ensures 
prediction is only for observations with nonmissing values of the dependent 
variable. 


The available statistics vary with the estimation command preceding the 
predict command. The predict command can provide 16 different 
statistics following regress. 


The default statistic xb is the prediction 7, = x! Statistic stdp gives 
the standard error of y; as a prediction of E (y;|x;), and staf gives the 
standard error of 7; as a forecast of yi. 


The statistic residuals (or equivalently, score) yields y; — J; and star 
gives the standard error of the residual, while rstandard yields standardized 
residuals and rstudent yields studentized residuals. The statistics cooksd, 
leverage, dfbeta(), covratio, dfits, and welsch are measures of 
observation leverage and influence. 


The statistic pr (a,b) yields the probability that y; lies in the range (a, b), 
while e (a,b) yields the expected value of y: if y: is restricted to lie in the 
range (a,b). The statistic ystar (a,b) yields the expected value of y;, where 
y =a ify; < a, yž = yi ifa < yi < b, and y, = b if y; > b. 


4.2.2 In-sample prediction 


We use the same data as that used in chapter 3 and begin with the linear 
regression model in levels rather than logs. Because we will compare results 
from the linear model with those from the model in logs, we restrict analysis 
to only those observations with positive medical expenditures. 


OLS regression yields 


. * OLS in levels for positive medical expenditures 


. qui use mus203mepsmedexp 


. keep if totexp > 0 
(109 observations deleted) 


. regress totexp suppins phylim actlim totchr age female income, vce(robust) 


Linear regression 


Robust 

totexp | Coefficient std. err. 
suppins 724.8632 427.3045 
phylim 2389.019 544.3493 
actlim 3900.491 705.2244 
totchr 1844.377 186.8938 

age -85.36264 37.81868 
female -1383.29 432.4759 
income 6.46894 8.570658 


_cons 8358.954 2847.802 


NOUNOU ABe 


Number of obs = 2,955 
F(7, 2947) = 40.58 
Prob > F = 0.0000 
R-squared = 0.1163 
Root MSE = 11285 
t P>|t| [95% conf. interval] 
70 0.090 -112.9824 1562.709 
39 0.000 1321.675 3456. 362 
53 0.000 2517.708 5283.273 
87 0.000 1477.921 2210.832 
26 0.024 -159.5163 -11.20892 
20 0.001 -2231.275 -535.3044 
75 0.450 -10.33614 23.27402 
94 0.003 2775.07 13942.84 


We then predict the level of medical expenditures. Because the 
dependent variable is always available, the qualifier if e (sample) is 


redundant. We obtain 


* In-sample prediction following OLS in levels 


. predict yhatlevels 
(option xb assumed; fitted values) 


summarize totexp yhatlevels 


Variable | Obs Mean Std. dev. Min Max 
totexp 2,955 7290.235 11990.84 3 125610 
yhatlevels 2,955 7290.235 4089.624 -236.3781 22559 


The summary statistics show that on average the predicted value yhatlevels 
equals the dependent variable. This suggests that the predictor does a good 
job. But this is misleading because this is always the case for in-sample 
prediction after OLS regression in a model with an intercept, because then 
residuals sum to zero, implying 5° y; = >> y;. The standard deviation of 
yhatlevels is $4,090, so there is some variation in the predicted values, 
though the variation is much less than that in the actual values. 


A more discriminating comparison is, for example, comparison of the 
median predicted and median actual values. We have 


. * Compare median prediction and sample median actual value after OLS 
. tabstat totexp yhatlevels, stat(count p50) col(stat) 


Variable N p50 
totexp 2955 3334 
yhatlevels 2955 6464.692 


There is considerable difference between the two. This is most likely due to 
the right skewness of the original data; the predictions are also right skewed 
but do not fully capture the right skewness. 


4.2.3 Prediction in logs: The retransformation problem 


Transforming the dependent variable by taking the natural logarithm 
complicates prediction. We can directly predict E(In y|x), but we are instead 
interested in E (y|x) because we want to predict the level of medical 
expenditures rather than the natural logarithm. The obvious procedure of 
predicting In y and taking the exponential is wrong because 


exp{E(Iny)} # E(y), just as, for example, \/E'(y?) 4 E(y). 


The log-linear model In y = x’ + u implies that y = exp(x’3) exp(u). 
It follows that 


E(yi|xi) = exp(x;8) E{exp(ui)} 


The simplest prediction is exp(x/3) , but this leads to retransformation bias 
because it ignores the multiple F {exp(u,)}. If it is assumed that 

ui ~ N(0, 07), then it can be shown that E{exp(u;)} = exp(0.507), which 
can be estimated by exp(0.5a7), where G2 is an unbiased estimator of the 
log-linear regression model error. A weaker assumption is to assume that u; 
is independent and identically distributed, in which case we can consistently 
estimate E{exp(u;)} by the sample average N-t 3m exp(t,;); see 

Duan (1983) . 


Applying these methods to the medical expenditure data yields 


. * In-sample prediction in levels from a logarithmic model 


. qui regress ltotexp suppins phylim actlim totchr age female income, vce(robust) 


. qui predict lyhat 


. generate yhatwrong = exp(lyhat) 


. generate yhatnormal = exp(lyhat)*exp(0.5*e(rmse) ~2) 


. qui predict uhat, residual 


. generate expuhat 


exp (uhat ) 


. qui summarize expuhat 


. generate yhatduan 


r (mean) *exp (lyhat) 


summarize totexp yhatwrong yhatnormal yhatduan yhatlevels 


Variable Obs Mean Std. dev. Min Max 
totexp 2,955 7290.235 11990.84 3 125610 
yhatwrong 2,955 4004.453 3303.555 959.5991 37726.22 
yhatnormal 2,955 8249.927 6805.945 1976.955 77723.13 
yhatduan 2,955 8005.522 6604.318 1918.387 75420.57 
yhatlevels 2,955 7290.235 4089.624 -236.3781 22559 


Ignoring the retransformation bias leads to a very poor prediction because 
yhatwrong has a mean of $4,004 compared with the sample mean of $7,290. 
The two alternative methods yield much closer average values of $8,250 and 
$8,006. Furthermore, the predictions from log regression, compared with 
those in levels, have the desirable feature of always being positive and have 
greater variability. The standard deviation of yhatnormal, for example, is 
$6,806 compared with $4,090 from the levels model. 


4.2.4 Prediction exercise with a binary regressor 


There are several different ways that predictions can be used to simulate the 
effects of a policy experiment. We consider the effect of a binary treatment, 
whether a person has supplementary insurance, on medical expenditure. This 
is an example of estimating an average treatment effect, the subject of 
chapters 24 and 25. 


We base our predictions on estimates that assume supplementary 
insurance is exogenous. More realistically, supplementary insurance may be 
endogenous where, as we discuss in section 7.3.1, a variable is endogenous 


if it is related to the error term. In that case, one should use other methods 
such as those presented in chapters 7 and 25. The simpler analysis here 
assumes that supplementary insurance is not related to the error term. 


Compare means of the dependent variable 


An obvious approach is to compare the difference in sample means 
(Yı — Yo), where the subscript 1 denotes those with supplementary insurance 
and the subscript 0 denotes those without supplementary insurance. 


. * Effect of suppins: (1) compare actual means with suppins==1 and suppins== 
. bysort suppins: summarize totexp 


-> suppins = No 
Variable Obs Mean Std. dev. Min Max 


totexp 1,207 6824.303 11425.94 9 104823 


-> suppins = Yes 
Variable Obs Mean Std. dev. Min Max 


totexp 1,748 7611.963 12358.83 3 125610 


The average difference is $788 (from 7612 — 6824). This measure has the 
major limitation that it does not control for individual characteristics. 


Compare means of predictions of the dependent variable 


A measure that does control for individual characteristics is obtained by 
regression on suppins and additional control variables using all observations 
and by computing the difference in mean predictions (7, — Jo), where 7, 
and 7, denote, respectively, the average prediction for those individuals with 
and without supplementary health insurance. 


We implement this measure for the complete sample based on 
predictions obtained from the earlier OLS regressions in both levels and logs. 
We obtain 


. * Effect of suppins: (2) compare predicted means by suppins value 
. bysort suppins: summarize yhatlevels yhatduan 


-> suppins = No 


Variable Obs Mean Std. dev. Min Max 
yhatlevels 1,207 6824. 303 4077.064 -236.3781 20131.43 
yhatduan 1,207 6745.959 5365.255 1918.387 54981.73 


-> suppins = Yes 


Variable Obs Mean Std. dev. Min Max 
yhatlevels 1,748 7611.963 4068.397 502.9237 22559 
yhatduan 1,748 8875.255 7212.993 2518.538 75420.57 


Using predictions from OLS in levels, we see the average difference is again 
$788 (from 7612 — 6824). This surprising equality of the difference in mean 
fitted values and the difference in sample means is a consequence of OLS 
regression in levels with an indicator variable and prediction using the 
estimation sample. 


Using instead OLS in the log-linear model, with subsequent levels 
predictions that correct for retransformation bias using Duan’s method, gives 
an average difference of $2,129 (from 8875 — 6746), which differs from 
$788. 


Compare predictions at different values of the key regressor variable 


A third measure is obtained by also regressing on suppins and additional 
control variables using all observations. Then 1) predict y with suppins set 
to 1 for all observations, and average these predictions; 2) predict y with 
suppins Set to 0 for all observations, and average these predictions; and 

3) compute the difference in these averages. This method is presented in 
more detail in section 24.4. 


To implement this method, we need to make separate predictions for 
each individual with suppins set to 1 and with suppins set to 0. This 
requires changing the variable suppins, so we use preserve and restore 
(see section 2.5.2) to eventually return suppins to its original sample values. 


. * Effect of suppins: (3) predict with all set to suppins==0 versus all set to 1 
. qui regress totexp suppins phylim actlim totchr age female income, vce(robust) 


. preserve 
. qui replace suppins = 1 


. qui predict yhati 


i] 
(e) 


. qui replace suppins 
. qui predict yhatO 

. generate treateffect = yhatl - yhatO 
. summarize yhati yhatO treateffect 


Variable Obs Mean Std. dev. Min Max 

yhati 2,955 7586.313 4071.367 488.4851 22559 

yhatoO 2,955 6861.45 4071.367 -236.3781 21834.14 

treateffect 2,955 724.8632 .0002081 724.8623 724.8643 
. restore 


The estimate of $725 is close to the preceding linear model estimate of $788. 


More precisely, the estimate is 724.8632, which equals the coefficient of 
suppins in the original OLS regression; see section 4.2.2. This equality is due 
to fitting a linear model with the regressor of interest changing its value by 
one unit. 


This example includes control variables in the regression and then 
computes the average marginal effect (AME) of suppins using the finite- 
difference method; see section 3.5.9. It is simpler to use the margins 
command, which has the additional benefit of giving a standard error of the 
average marginal treatment effect. 


. * Effect of suppins model: (3b) same as previous but easier and gives se 
. qui regress totexp i.suppins phylim actlim totchr age female income, vce(robust) 


. margins, dydx(suppins) nofvlabel 


Average marginal effects Number of obs = 2,955 
Model VCE: Robust 


Expression: Linear prediction, predict() 
dy/dx wrt: 1.suppins 


Delta-method 
dy/dx std. err. t P>ltl [95% conf. interval] 


1.suppins 724.8632 427.3045 1.70 0.090 -112.9824 1562.709 


Note: dy/dx for factor levels is the discrete change from the base level. 


The margins command is presented in considerable length in sections 4.4 
and 4.5. 


Compare predictions from different models for each value of the key regressor variable 


A refinement of the preceding method is to use two separate OLS regressions, 
one for each value of suppins, obtain predictions for all observations from 
each of these models, and compare the average predictions. We have 


. * Effect of suppins model: (4) different fitted models for suppins =0 and =1 
. qui regress totexp phylim actlim totchr age female income if suppins== 


. qui predict yhatmodel0O 

. qui regress totexp phylim actlim totchr age female income if suppins== 
. qui predict yhatmodell 

. generate treateffectra = yhatmodell - yhatmodel0O 


. summarize yhatmodeli yhatmodelO treateffectra 


Variable Obs Mean Std. dev. Min Max 
yhatmodeli 2,955 7605.52 4087 .529 579.9417 22436 .2 
yhatmodel0 2,955 6935.549 4082.734 -428.2472 22108.39 

treateffec“a 2,955 669.971 645.3849 -5050.99 2361.274 


The average treatment effect is now $670. 


This method, called regression adjustment, is presented in section 24.6.1. 
The teffects ra command gives the same estimate of $670 and additionally 
provides a standard error of the estimate. 


4.2.5 Forecast actual value versus prediction of conditional mean 


Predictions given a regressor value Xp can be used for two very distinct 
purposes. 


First, we may wish to predict the conditional mean E(y|xp) = x8. For 
example, on average, what level of health expenditure do we expect for 
individuals with characteristics Xp? Then we use the fitted value = x, B: 


which is a noisy estimate because B is a noisy estimate of 8. 


Second, we may wish to instead predict the actual value of y when 
X = Xp, This type of prediction is called a forecast. For example, what exact 
level of health expenditure do we expect for a single individual with 
characteristics Xp. This is more challenging because then we wish to predict 
Yp = X B + Up. Because Up is pure noise our best estimate is its mean of 
zero, So we forecast Yp using x; B +0. So we again use Yp = Xp B, but there 
is now much more noise because of the additional need to predict Up. In fact, 
the variance of the forecast will be at least the variance of the error Up, 
regardless of how precisely 3 is estimated. 


The standard error of the prediction used as an estimate of the 
conditional mean can be obtained using the stdp option of the predict 
command. Obtaining the standard error of the prediction used as a forecast is 
more problematic because it requires an estimate of Var(Up). The staf 
option of the predict command provides the standard error of a forecast 
only if default standard errors are used, because then given 
homoskedasticity, the standard deviation of Up can be estimated using the 
standard error of the regression. 


We therefore reestimate without the vce (robust) option and use 
predict to obtain 


. * Compute standard errors of prediction and forecast with default VCE 
. qui regress totexp suppins phylim actlim totchr age female income 


. display "Estimated standard deviation of the error: s = " e(rmse) 
Estimated standard deviation of the error: s = 11285.257 


. predict yhatstdp, stdp 
. predict yhatstdf, stdf 
summarize yhatlevels yhatstdp yhatstdf 


Variable Obs Mean Std. dev. Min Max 
yhatlevels 2,955 7290.235 4089.624 -236.3781 22559 
yhatstdp 2,955 572.7 129.6575 393.5964 2813.983 
yhatstdf 2,955 11300.52 10.50946 11292.12 11630.8 


The first quantity, yhatstdp, views x! 3 as an estimate of the conditional 
mean x46 and is quite precisely estimated; the prediction has average 
standard deviation of $573, which is small compared with the average 
prediction of $7,290. 


The second quantity, yhatstd£, views x’ B as an estimate of the actual 
value y; and is very imprecisely estimated with average standard deviation 
equal to $11,301. This is because the error u; here has large estimated 
standard deviation of s = 11285 in the fitted OLS model. The standard 
deviation of the forecast is necessarily at least 11285. For further details and 
formulas, see section 4.3. 


More generally, microeconometric models can predict poorly for a given 
individual, as evidenced by the typically low values of R2 obtained from 
regression on cross-sectional data. These same models may nonetheless 
predict the conditional mean well, and it is this latter quantity that is needed 
for policy analysis that focuses on average behavior. 


4.3 Out-of-sample prediction 


We now consider out-of-sample prediction. While we focus on the precision 
of predictions, note that out-of-sample predictions become more 
questionable as regressor values become further away from those used to fit 
the model. Accuracy of out-of-sample predictions is often a basis of a 
demanding test of model specification but often may not provide a clear 
indication of the deficient feature of a potentially flawed model. 


4.3.1 Out-of-sample predictions 


Out-of-sample predictions are useful for generating predictions per se but 
also for testing the fitted model against a subsample that was not used to fit 
the model. Some tests of model misspecification are based on out-of-sample 
prediction errors. The predict command can generate both in-sample and 
out-of-sample predictions. 


Let X, denote the q x K matrix for a set of q observations that were not 
used in estimation and held back to generate predictions outside the 
estimation sample. We suppose the dependent variable is generated by the 
same model as that fitted, so the q out-of-sample y’s are generated by 
Yp = Xp + up. Hence, we predict the conditional mean and forecast the 
actual value using y,, = X,. Then the vector of q prediction errors is 
Up = Fp — yp. These prediction errors can be used to evaluate the fitted 
regression in a number of ways. 


For the prediction exercise to be meaningful, the matrix X, should be 
“similar to” the one used in estimation; that is, its range of variation in the 
prediction sample should be similar to that of the estimation sample. 
Otherwise, the estimated sample estimates may be inappropriate for the 
prediction sample, and the divergence between regressors in the estimation 
and prediction samples will contribute significantly to the prediction error. 
Prediction-based tests may be used to test for sample homogeneity. 


Predictions of the conditional mean have variance matrix 


Var { E(ypIXp) } = Var (xð) = X,Var (3) X 


This formula is used for calculating hypothesis tests and confidence intervals 
for the point-predicted conditional means. It can be used with default 
standard errors or with robust standard errors. Note that as the sample size 
gets larger, Var(8) gets smaller, so the precision of the prediction improves. 


Forecasts of the actual value additionally need to predict the error term 
because yp = X p6 + up. Our best estimate is up = 0. Because the error is 
independent of the regressors, it follows that forecasts of the actual values 
have variance matrix 


Var (¥p|Xp) = Var (x,ð z up) = X,Var (2) X’, + Var(u,) 


This is more problematic to estimate because it depends on Var(u,). 
Furthermore, note that even in large samples, there can be considerable 
forecast error because Var(y,) > Var(u,) always. 


If the errors are independent and identically distributed (1.1.d.), then 
Var(u,) = o7I and Var(3) = o?(X'X)~!, so the variance of the forecast 
simplifies to 0? {I + X,(X'X)~'X/}. Then, in the case of a single out-of- 
sample observation, a large-sample 95% confidence interval for a forecast of 
the actual value of the dependent variable when X = Xp is 


p + 1.964/5?{1 + x,(X'X)-!xp} 


where <? is the sample estimate of g2. The corresponding large-sample 95% 
confidence interval for the prediction of the conditional mean when X = Xp 
is 


Yp + 1.96,/s?x!,(X/X)—!x, 


The point prediction and the estimated variance ,2 may both be sensitive to 
the presence of outliers generated by a mechanism different from the one 
that prevailed in the sample period. 


The preceding considerations motivate the use of diagnostic procedures 
for testing homoskedasticity and normality of residuals mentioned in a 
previous subsection. The normal distribution is symmetric, and hence its 
third moment is 0. The standard normal distribution has fourth moment 
equal to 3. Hence, departures from normality are indicated by a skewness 
coefficient significantly different from 0 and a nonnormal kurtosis 
coefficient significantly greater than 3; for an example, see section 3.7.5. 


4.3.2 Out-of-sample prediction example 


The example given here is based on generated data with 100 observations. 
The model is fit to the subsample of the first 90 observations, and the 
remaining 10 are used for prediction. We show how out-of-sample prediction 
is implemented. 


In the first part, the 100 observations are generated using the regression 
y = 1 + x + u. The correctly specified regression is fit to the first 90 
observations, and 10 out-of-sample predictions are generated. For 
comparison, the regression is also fit to the full sample, and the 
corresponding 10 in-sample predictions are also generated. The prediction 
errors as well as fitted values are summarized. In this example, the data- 
generating process (DGP) satisfies the underlying assumptions, so the 
prediction errors reflect the estimation error and the intrinsic randomness 
due to the error term. 


. * Qut-of-sample OLS predictions using correctly specified model 
. Clear all 


. qui set obs 100 

. set seed 10101 

. generate u = 3*rnormal() 

. generate x = 10 + 4*rnormal() 

. generate y= 1+x+u 

. qui regress y x if _n < 91 

. qui predict yhatos if _n > 90 // Predict 10 out-of-sample observations 
. qui regress y X 

. qui predict yhatis if _n > 90 // Predict 10 in-sample observations 


. qui generate ferroros = y - yhatos 


. summarize y yhatis yhatos ferroros if _n > 90 


Variable Obs Mean Std. dev. Min Max 
y 10 10.96446 2.697358 6.751842 16.415 

yhatis 10 12.57903 3.894589 3.96609 19.10511 
yhatos 10 12.85367 4.134129 3.710977 19.78113 
ferroros 10 -1.889211 3.188283 -6.584855 4.737788 


The 10 out-of-sample predictions have mean and standard deviation of 
(12.854, 4.134) compared with the corresponding 10 in-sample predictions 
using all 100 observations of (12.579, 3.895). The sample mean of the 
prediction error (ferroros) is expected to approach zero as the sample size 
gets larger, in the current example with correctly specified model. 


We continue with data generated by the DGP y = 1 + x + z + u but 
generate predictions from estimation of a misspecified model that omits 
variable z, that is, assuming y = 1 + x + u. By construction, the omitted 
variable is uncorrelated with x, so the OLS regression coefficients should be 
unbiased, but the predictions will be biased, and the prediction errors will no 
longer have average close to zero in large samples. 


* Qut-of-sample prediction under misspecification (omitted regressor) 
. generate z = 10 + 4*rnormal() 


. generate ynew = 1+x+z+u 

. qui regress ynew x if _n <91 // Regression with variable z omitted 

. qui predict ynewhatos if _n >90 // Predict 10 out-of-sample observations 
. qui regress ynew x 

. qui predict ynewhatis if _n > 90 // Predict 10 in-sample observations 

. qui generate ferrornewos = ynew - ynewhatos if _n > 90 

. list ferrornewos in 91/100, clean 


ferrorn’s 


91. 4.19217 
92. 1.800138 
93. -3.714031 
94. 4.681175 
95. -5.369816 
96. -2.724915 
97. -7 . 702372 
98. -13.30714 
99. -8.065577 
100. -9.890109 
summarize ynew ynewhatis ynewhatos ferrornewos if _n > 90 
Variable Obs Mean Std. dev. Min Max 
ynew 10 18.85934 5.791399 10.47607 28.13668 
ynewhatis 10 22.35754 3.922385 13.68313 28.93019 
ynewhatos 10 22.36938 4.232142 13.50993 29.96109 
ferrornewos 10 -4.010048 6.066775 -13.30714 4.681175 


The estimated mean prediction error is — 4.010, indicating that predicted 
value is an underestimate of the realized value. The range of the prediction 
error is (— 13.307, 4.681 ), which is also greater relative to the correct 
specification case. Of course, this reflects the impact of z, and it would be 
smaller or larger depending on its importance relative to zx. 


Prediction errors are an obvious metric for measuring the performance of 
a regression model. However, because prediction errors are expected to be 
i.i.d. under correct specification of the DGP, they can be used for checking 
model specification. For example, given 1.1.d. model errors, the chi-squared 
distributed statistic u/,[o? {I + x,(X’X)~'x/}~']U, provides a basis for a 
formal test, although graphical displays are potentially also helpful in 
indicating weaknesses in the model specification. 


The use of out-of-sample prediction for model selection, notably k-fold 
cross-validation, is detailed in section 28.2. 


4.4 Predictive margins 


Following linear regression, interest may lie in evaluating the conditional 
mean at different values of the regressors. For example, following regression 
that includes many control variables, we may wish to predict medical 
expenditure separately for men and for women, or we may wish to predict at 
several different ages. These are examples of predictive means (PMs), which 
will vary with the value assigned to the regressors. 


More generally, quantities other than the conditional mean may be 
studied, such as conditional probabilities following logistic regression. Lane 
and Nelder (1982) introduced the term predictive margins to cover 
postestimation prediction of a quantity of interest at various values of the 
regressors. 


4.4.1 Predictive margins 


Following linear regression, a PM can be evaluated at a single point x = x*, 
in which case PM = x” Bs or at different regressor values for different 


individuals and averaged. When evaluated at values x; = x7,i=1,...,N, 
the PM is 
ix 
ID 
=p XP 
i=1 


In general, predictive margins of g(x, 3) are computed as 


(A/N) Eig (x, B). 


There are many ways to specify the evaluation eee x;. Leading cases 
are 1) compute PM at sample values X;,i = 1,..., N, and average; 
2) compute at the sample mean x, in which case pM = x Â; or 3) compute at 
a specified value x*, in which case PM = x* ‘3. 


In practice, for models with multiple regressors, combinations of these 
methods are often used. Consider predictive margins for various specified 
values of z, often a scalar variable but possibly more than one variable, 
where z is a subcomponent of the regressors in the fitted model 


Let Zz denote the £th-specified value of z, in which case 


(Filz = Zk, Wi = W7) = ZQ + wA 


Then x’ = |z; w;"], and there are many ways to specify the evaluation 
points x; because there are many ways to specify values for the remaining 
variables w;. 


First, we may set the remaining regressors at their sample values, so 
WwW; = wi. Second, we may set the remaining regressors at a common 
specified value w*, so w; = w*. Third, we may set the remaining regressors 
at their sample mean values, so w; = w. Finally, any combination of these 
is possible. For example, letting wiy = w1; Y1 + W5;Y2 + W3;73, we might 
evaluate at w?’ = [w]; w3’ w3]. Then some of the remaining regressors are 
set to sample values, some to specified values, and some to sample means. 


In principle, predictive margins can be computed using one or more Stata 
commands, usually generate and predict commands. And standard errors 
of these prediction margins can be obtained using an appropriate bootstrap. 


Much simpler is to directly use the powerful margins command. This 
command computes predictive margins and their associated standard errors, 
t statistics, and confidence intervals. It additionally can be used to compute 
MEs, and the marginsplot command presents results graphically. The 
discussion is brief compared with the more than 100 pages of documentation 
given in [R] margins. 


4.4.2 The margins command 


The margins command after essentially any estimation command, including 
regress, has syntax 


margins | marginlist | [ of | [ in | [ weight | [. response_options options | 


Here marginiist is a list of factor-indicator variables and any associated 
interactions defined using factor-variable operators. If marginlist is left 
blank, then the overall margin is reported. 


The default response_option for the margins command is to use the 
quantity that is the default for the postestimation command predict. In 
general, the response_option can be specified to be one of the other 
quantities calculated by the postestimation command predict, or a function 
thereof. In general, however, not all the predict options are available for 
margins, and the only available predict option for margins following 
regress is the default xb. Other response_options compute MEs (option 
dydx () ) and elasticities; these are presented in the subsequent section. 


Key options are those that define the regressor values at which the 
margins are estimated. The at () option estimates margins at specified values 
of regressors. The atmeans option evaluates margins at the means of the 
regressors. For regressors not appearing in the at () option, when atmeans is 
not specified, the margins are estimated at sample values, and the predictive 
margin 1s computed by averaging over the estimation sample. Thus the 
sample average prediction is obtained if atmeans is not specified and some 
regressors are not included in the at () option, whereas the prediction at the 
sample mean is obtained if the atmeans option only is used. 


Other options include over (varlist) to estimate margins at unique values 
of varlist and subpop () to estimate margins for subpopulations. 


The output from the margins command includes standard errors and 
associated statistics for inference. The default standard errors are obtained 
by applying the delta method to the coefficient estimate standard errors 
computed from the preceding estimation command. 


The default standard errors of margins estimates treat the regressors as 
fixed. This is fine if the value of every regressor is specified using the at () 
Or atmeans option or if sample values are used and inference is on the in- 
sample predictive margin. But if a population predictive margin is desired 
and sample values of regressors are used in computing the predictive margin, 
then the vce (unconditional) option additionally controls for variation in 
the regressors due to sampling; see section 13.7.9. 


If population-average margins are desired and population weights are 
provided, then one needs to compute weighted margins. If weight was used 
in the preceding estimation command, then weighted margins are 
automatically given. If the preceding estimation command did not use 
weights, as is commonly the case in microeconometrics, then one needs to 
subsequently add weight to the margins command if weighted margins are 
desired. 


The noesample option allows predictive margins to be computed for 
samples other than the estimation sample. 


The margins, contrast command performs contrasts of margins. The 
margins, pwcompare command performs pairwise comparison of margins. 
The marginsplot command graphs the result of the preceding margins 
command. The margins, dydx() command computes MES. 


4.4.3 Predictive margins examples 


The following examples consider only OLS estimation, with factor variables 
introduced to allow analysis in the presence of a quadratic in age and an 
interaction of gender and income. Then the linear model includes 
explanatory variables that enter nonlinearly. The examples consider only 
predicted conditional means and changes in predicted conditional means as 
regressors change. But these examples extend immediately to other 
estimators, such as logit, and to predictions of other quantities, such as 
predicted probabilities. 


Consider the simple examples of using the margins command following 
the OLS regression command regress y x i.z. Recall that the default for 


the predict command following the regress command yields the OLS 
prediction 7. 


The margins command gives the sample average prediction 
K N AW 
Y = (1/N) Xiz Yi 


The margins z command computes the sample average prediction 7 at 
each distinct value taken by the categorical variable z. In the simplest case 
where the factor variable i.z is a binary regressor taking values 0 and 1 the 
fitted model is 7 = By i Bo c+ Bs z. For z = 0, the predictive margin is 
(1/N) Tali E Baxi) , and for z = 1, the predictive margin is 


(1/N) 3, (bi + Boas + Bs): 


For the regressor x, the margins x command will fail because variable x 
was not declared to be a categorical factor variable. Similar problems arise if 
we had used command regress y c.x i.z. Instead, we need to use the 
at () option to compute the predictive margin at various specified values of 
x. For example, the command margins, at (x=(30(10)50)) computes 
(1/N) 1 (81 + Box* + B32z;) at, respectively, x* = 30, 40, and 50. 


4.4.4 Predictive margins for a categorical factor variable 


We consider regression of totexp on various regressors, including a 
quadratic in age and interaction of female and income. The model is the 
same as that in section 4.2, except that now the dependent variable totexp is 
analyzed using a sample that includes the 109 observations with totexp = 0. 
For completeness, all regressors are defined using factor-variable operators, 
though for the following analysis, this is unnecessary for many of the 
regressors. 


OLS regression yields the following estimated coefficients: 


. * Factor variables used to define model with interactions and quadratic 


. qui use mus203mepsmedexp, clear 


. regress totexp i.suppins i.phylim i.actlim c.totchr c.age##c.age 


> i.female##c.income, vce(robust) noheader nofvlabel 
Robust 
totexp | Coefficient std. err. t P>|t| [95% conf. interval] 
1.suppins 854.581 413.6253 2.07 0.039 43.56894 1665.593 
1.phylim 2432.789 536.4219 4.54 0.000 1381.004 3484.573 
1.actlim 3779.804 690.7463 5.47 0.000 2425.429 5134.178 
totchr 1921.016 180.5999 10.64 0.000 1566.906 2275.125 
age 1342.275 763.421 1.76 0.079 -154.5957 2839.146 
c.age#c.age -9.452011 5.041687 -1.87 0.061 -19.33745 . 433432 
1.female -1573.508 573.3188 -2.74 0.006 -2697 .638 -449.3781 
income 2.090179 11.27985 0.19 0.853 -20.02669 24.20705 
female#c.income 

1 15.42391 15.81709 0.98 0.330 -15.58931 46.43713 
_cons -45432.3 28774.04 -1.58 0.114 -101850.7 10986.15 


The stand-alone margins command gives the average prediction of 
totexp in the estimation sample. 


* Predictive margin using sample average 
. Margins 


Predictive margins 
Model VCE: Robust 


Expression: Linear prediction, predict() 


Delta-method 
Margin std. err. t P>|t| 


_cons 7030.889 200.6571 35.04 0.000 


Number of obs = 3,064 


[95% conf. 


6637 .453 


interval] 


7424.326 


For OLS regression with an intercept, the residuals sum to 0 so 7 = 9, and the 
average predicted medical expenditure of $7,031 equals the average of 
totexp in the estimation sample. More generally, for most estimation 


methods, 7 + y. 


If we wish to use this as an estimate for the population, we should use 
the option vce (unconditional) to account for variability in the sample 


regressors. We have 


. * Predictive margin using sample average with population-average inference 
. margins, vce(unconditional) 


Predictive margins Number of obs = 3,064 


Expression: Linear prediction, predict() 


Unconditional 
Margin std. err. t P>|t | [95% conf. interval] 
_cons 7030.889 214.4439 32.79 0.000 6610.42 7451.358 


The standard error has increased from 200.7 to 214.4, a 7% increase. 


We next compute the estimation sample average prediction of totexp for 
the different values taken by the categorical variable female. 


. * Predictive margins by categorical regressor gender using sample averages 
. margins female, nofvlabel 


Predictive margins Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 


Delta-method 


Margin std. err. t P>|t| [95% conf. interval] 

female 
0 7764.118 337 . 7036 22.99 0.000 7101.968 8426 . 267 
1 6537 .258 245.379 26.64 0.000 6056.133 7018.383 


The average predicted medical expenditure for this sample of older people is 
$7,764 for men and $6,537 for women, so it is higher for men. This 
computation allows for variable female entering directly and through the 
interaction term with income. Because female enters interactively, the 
predicted gender difference of 7764 — 6537 differs from the raw data gender 
difference of 7452 — 6725. 


The two 95% confidence intervals are nonoverlapping, which suggests, 
though does not guarantee, that the gender difference is statistically 
significant at level 0.05. A formal test is possible using the margins, 
pwcompare Or margins, contrast (ci) command. This test, given in 
section 4.4.8, finds a statistically significant difference at level 0.05. 


In the next example, the predictive margins by gender are computed at 
sample mean values of the remaining regressors, using the atmeans option. 


* Predictive margins by categorical regressor gender using sample means 
. margins female, atmeans nofvlabel 


Adjusted predictions Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
At: O.suppins = .4187337 (mean) 


1.suppins = .5812663 (mean) 
O.phylim = .5744125 (mean) 
1.phylim = .4255875 (mean) 
O.actlim = .7163838 (mean) 
1.actlim = .2836162 (mean) 
totchr = 1.754243 (mean) 
age = 74.17167 (mean) 
O.female = .4203655 (mean) 
1.female = .5796345 (mean) 
income = 22.47472 (mean) 


Delta-method 


Margin std. err. t P>I|tl [95% conf. interval] 

female 
(0) 8147.388 391.5846 20.81 0.000 7380.084 8915.676 
1 6921.02 313.0471 22.11 0.000 6307.216 7534.824 


Now the predicted medical expenditure is $8,148 for men and $6,921 for 
women, compared with, respectively, $7,764 and $6,537 in the preceding 
example, which used sample values of variables other than female and then 
averaged over the estimation sample. This difference between prediction at 
the average and an averaged prediction usually arises; a notable exception is 
oLs in a purely linear model that has no nonlinearities such as quadratic and 
interaction terms. 


Predictive margins by gender can be computed at specified values of 
some of or all the remaining regressors, using the at () option. Here we 
obtain predictive margins by gender with values specified for all the 
remaining regressors. 


. * Predictive margins by categorical regressor gender at specified values 
. margins female, at(suppins=1 phylim=1 actlim=1 totchr=2 age=70 income=50) 
> nofvlabel 


Adjusted predictions Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
At: suppins = 1 


phylim = 1 
actlim = 1 
totchr = 2 
age = 70 
income = 50 
Delta-method 
Margin std. err. t P>|t| [95% conf. interval] 
female 
(0) 13225.83 793.4232 16.67 0.000 11670.13 14781.53 
1 12423.52 772.1657 16.09 0.000 10909.5 13937.53 


In this example, evaluation was for a person with functional limitations, 
activity limitations, and two chronic conditions. As expected, this leads to 
above-average predicted medical expenditures. Again, expenditures are 
higher for men. 


4.4.5 Predictive margins for continuous variables 


We now consider a continuous variable, age, where by “continuous” we 
mean a variable that is not specified to be a categorical factor variable in the 
original OLS regression. 


Then the command margins age fails. Instead, we need to use the at () 
option to evaluate the predictive margins for various values of variable age. 


The following example obtains predictive margins at ages 65, 75, and 85, 
with the remaining variables evaluated at their sample values. 


. * Margins at 3 different ages (averaged over other regressors” sample values) 
. margins, at(age=(65(10)85)) 


Predictive margins Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 


1._at: age = 65 
2._at: age = 75 
3._at: age = 85 
Delta-method 
Margin std. err. t P>|t| [95% conf. interval] 
at 

1 7168.639 562.4323 12.75 0.000 6065.855 8271.423 

2 7358.576 282.7122 26.03 0.000 6804.25 7912.901 

3 5658.11 457.4907 12.37 0.000 4761.089 6555.131 


For each specified value of age, the predictive margins are computed by 
evaluating all other variables at their values for each observation and then 
averaging. This yields the apparently surprising result that medical 
expenditures eventually decrease with age. This result arises because the 
fitted quadratic relationship 1342.3 x age — 9.452 x age? has a turning 
point at age = 1342.3/(2 x 9.452) = 71.01 , so totexp declines with age 
for age > 71. The raw data show expenditures decline in age beginning in 
the early 80s. 


4.4.6 Plots of predictive margins 


The marginsplot command graphs the result of the immediately preceding 
margins command. The command options provide different ways of 
presenting the graphs, including plots, subgraphs, and graphs over groups of 
variables used in the preceding margins command. The default is to include 
95% confidence intervals for the predictive mean in the graphs. The noci 
option omits these. 


We first plot the overall predictive margin as age varies in 5-year 
increments from 65 to 90 years. 


* Marginsplot for predictive margins by age 
. qui margins, at (age=(65(5)90)) 


. marginsplot 


Variables that uniquely identify margins: age 


The plot is given in the first panel of figure 4.1. As expected, the 


confidence intervals are widest at the sample extreme ages of 65 and 90. 
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Figure 4.1. Predictive margins 1) by age and 2) by age and gender 


The over (female) option of the margins command leads to separate 
computation by gender of the predictive margins. We add this option to the 
preceding example with age varying in 5-year increments from 65 to 90 
years. 


. * Marginsplot for predictive margins by age for each gender 
. Margins, at(age=(65(5)90)) over(female) nofvlabel 


Predictive margins 
Model VCE: Robust 


Expression: Linear prediction, predict() 
Over: female 
1._at: O.female 
age = 65 
1.female 
age = 65 
2._at: O.female 
age = 70 
1.female 
age = 70 
3._at: O.female 
age = 75 
1.female 
age = 75 
4._at: O.female 
age = 80 
1.female 
age = 80 
5._at: O.female 
age = 85 
1.female 
age = 85 
6._at: O.female 
age = 90 
1.female 
age = 90 


Number of obs = 3,064 


Delta-method 
Margin std. err. 


_at#female 
10 7529.769 600.8899 
6906.739 596.8414 
7861.037 376.1116 
7238.007 330.6136 
7719.705 388.9309 
7096.675 317.0534 
7105.773 390.0599 
6482.743 302.5363 
6019.24 535.7552 
5396.21 474.6123 
4460.106 1039.299 
3837 .076 1012.879 
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[95% conf. 


6351.579 
5736 .487 
7123.58 
6589.76 
6957.113 
6475.016 
6340 . 966 
5889.548 
4968.763 
4465.618 
2422.31 
1851.084 


interval] 


8707 .958 
8076.99 
8598 .495 
7886.255 
8482.298 
7718.335 
7870.58 
7075.938 
7069.717 
6326 .802 
6497 .903 
5823.069 


The plot is given in the second panel of figure 4.1. The 95% confidence 
intervals are overlapping, suggesting no statistically significant difference by 
gender at each age. 


The confidence intervals for the different genders can be more clearly 
seen by using the by (female) option of the marginsplot command to 
produce separate graphs by gender. 


. * Marginsplot for predictive margins by age, separate graphs by gender 
. qui margins, at(age=(65(5)90)) over(female) 


. marginsplot, by(female) 


Variables that uniquely identify margins: age female 


Figure 4.2 presents side-by-side graphs by gender that were combined in 
the second panel of figure 4.1. 
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Figure 4.2. Predictive margins 1) by age for men and 2) by age for 
women 


4.4.7 Pairwise comparisons of predictive margins 


The margins command can produce several predictive margins. Interest may 
lie in calculating differences across these predictive margins and testing 
whether the differences are statistically significant. 


The margins, pwcompare command performs pairwise comparisons 
across the levels of factor variables or variables defined by the at () option 
or over () option from the most recent margins command. If the margins 
command yields K predictive margins, then there are K!/{2!(K — 2)!} 


unique pairwise comparisons. The default output includes the pairwise 
differences in predictive margins and associated standard errors and 
confidence intervals. 


As an example, consider comparison of the PM evaluated at variable 
income equaling 10, 20, and 30 (recall that income is measured in thousands 
of dollars per year). 


We begin by obtaining the PMs at the various values of income. 


. * Predictive margins for three income values 

. margins, at (income=(10(10)30)) 

Predictive margins Number of obs = 3,064 
Model VCE: Robust 

Expression: Linear prediction, predict() 

1._at: income = 10 

2._at: income = 20 

3._at: income = 30 


Delta-method 


Margin std. err. t P>|t | [95% conf. interval] 

-at 
1 6915.386 227 . 0065 30.46 0.000 6470.285 7360.487 
2 7025.69 202.6934 34.66 0.000 6628.261 7423.12 
3 7135.994 211.7422 33.70 0.000 6720.823 7551.166 


There are then three unique pairwise comparisons: income 20 versus 10, 
income 30 versus 10, and income 30 versus 20. For each comparison, the 
pwcompare option default yields the difference in PMs and its standard error 
and associated 95% confidence interval. The pwcompare (effects) option 
additionally reports a t-test statistic and its associated degrees of freedom. 


. * Pairwise comparison of predictive margins for three income values 
. margins, at(income=(10(10)30)) pwcompare(effects) 


Pairwise comparisons of predictive margins Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
1._at: income = 10 
2._at: income = 20 
3._at: income = 30 


Delta-method Unadjusted Unadjusted 
Contrast std. err. t P>|t| [95% conf. interval] 
_at 
2vs 1 110.3041 84.25407 1.31 0.191 -54.89631 275.5045 
3 vs 1 220.6082 168.5081 1.31 0.191 -109.7926 551.009 
3 vs 2 110.3041 84.25407 1.31 0.191 -54.89631 275.5045 


From the preceding command, the predictive margins were 6915.39 for 
income 10, 7025.69 for income 20, and 7135.99 for income 30. Thus, the 
first contrast equals 7025.69 — 6915.39 = 110.30, and so on. The output 
gives the three contrasts and additionally provides standard errors, a t-test 
statistic and its p-value, and a 95% confidence interval. For the first contrast, 
t = 110.30/84.25 = 1.31. At level 0.05, there is no statistically significant 
difference by income for any of the pairwise comparisons because all three 
test statistics have p > 0.05. The test statistics take the same value at each 


level of income because of the linearity of this example; more generally, 
they will differ. 


For the (0, 1) binary indicator variable, the commands margins female, 
pwcompare and margins, at(female=(0 1)) pwcompare yield identical 
results. Both of these commands compute the margins with everyone treated 
as male and the margins with everyone treated as female and then compare 
the two margins. The command margins, over(female) pwcompare 
instead computes the margins for observations that are male in the dataset 
and the margins for observations that are female in the dataset and then 
compares the two margins. 


A more complicated example compares income—gender combinations. 
Using three income categories and the two gender categories, we have 


. * Pairwise comparison of predictive margins for six income-gender combinations 
. margins female, at(income=(10(10)30)) pwcompare(effects) nofvlabel 


Pairwise comparisons of predictive margins Number of obs = 3,064 


Model VCE: Robust 


Expression: Linear prediction, predict() 


1._at: income = 10 
2._at: income = 20 
3._at: income = 30 
Delta-method Unadjusted Unadjusted 
Contrast std. err. t P>ltl [95% conf. interval] 
_at#female 

(1 1) vs (1 0) -1419.269 477.7069 -2.97 0.003 -2355.928 -482.6092 
(2 0) vs (1 0) 20.90179 112.7985 0.19 0.853 -200.2669 242.0705 
(2 1) vs (1 0) -1244.128 457.1656 -2.72 0.007 -2140.511 -347 .7445 
(3 0) vs (1 0) 41.80358 225.5971 0.19 0.853 -400.5338 484.141 
(3 1) vs (1 0) -1068.987 465.6923 -2.30 0.022 -1982.089 -155.8848 
(2 0) vs (1 1) 1440.171 445.1906 3.23 0.001 567.2671 2313.074 
(2 1) vs (1 1) 175.1409 116.3454 1.51 0.132 -52.98229 403.2641 
(3 0) vs (1 1) 1461.072 440.0368 3.32 0.001 598.274 2323.871 
(3 1) vs (1 1) 350.2818 232.6908 1.51 0.132 -105.9646 806.5283 
(2 1) vs (2 0) -1265.03 421.6034 -3.00 0.003 -2091.685 -438 . 3746 
(3 0) vs (2 0) 20.90179 112.7985 0.19 0.853 -200.2669 242.0705 
(3 1) vs (2 0) -1089.889 429.391 -2.54 0.011 -1931.813 -247 .9641 
(3 0) vs (2 1) 1285.931 414.663 3.10 0.002 472.8846 2098.978 
(3 1) vs (2 1) 175.1409 116.3454 1.51 0.132 -52.98229 403.2641 
(3 1) vs (3 0) -1110.79 421.1068 -2.64 0.008 -1936.472 -285.109 


There are six age—gender combinations and 6!/(2!4!) = 15 unique pairwise 
comparisons. The first comparison is that of (10, female) to (10, male) with 
95% confidence interval [—2356, —483]. For those with an annual income of 
$10,000, there is a statistically significant difference between female and 


male medical expenditures at level 0.05 because p = 0.003 < 0.05. 


4.4.8 Contrasts of predictive margins 


As already noted, the margins command can produce several predictive 
margins. Interest may lie in calculating differences across these margins and 


testing whether they are statistically significant. 


The margins, contrast command performs comparisons of predictive 
margins across the levels of factor categorical variables. These comparisons 
are done separately at each value of variables specified in the at () option or 
over () option, should these options be used. It can lead to comparison of 


fewer contrasts than if the margins, pwcompare command is used, as 
illustrated below. 


We first contrast predictive margins by gender. The margins command 
default is to report only F-test statistics of any contrasts. The contrast (ci) 
option additionally lists the difference in predictive margins. We have 


. * Contrasts of predictive margins for gender 
. margins female, contrast(ci) nofvlabel 


Contrasts of predictive margins Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 


female 1 8.70 0.0032 


Denominator 3054 


Delta-method 
Contrast std. err. [95% conf. interval] 


female 


(1 vs base) -1226.86 415.9167 -2042.365 -411.3547 


From output given earlier, the predictive margins (with other variables set to 
their sample values) were 7764.12 for female = 0 and 6537.26 for 

female = 1, leading to a contrast of 6537.26 — 7764.12 = —1226.86. The 
contrast (ci) option reports this contrast and its standard error of 415.92. 
This leads to a t-test statistic of — 1226.86/415.92 = —2.950 and hence the 
F statistic of (—2.950)? = 8.70, also given in the output. 


Note that in this simple example, there is only one possible pairwise 
comparison, between male and female. Consequently, the margins female, 
pwcompare (ci) command will list the same results. 


Now consider contrasts of predictive margins by gender, at each of the 
three levels of income (10, 20, and 30), using contrast (nowald) to suppress 
the table of Wald tests. We have 


. * Contrasts of predictive margins for gender at each of three income levels 
. margins female, at(income=(10(10)30)) contrast(nowald) nofvlabel 


Contrasts of predictive margins Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
1._at: income = 10 
2._at: income = 20 
3._at: income = 30 


Delta-method 
Contrast std. err. [95% conf. interval] 
female@_at 
(1 vs base) 1 -1419.269 477.7069 -2355.928 -482.6092 
(1 vs base) 2 -1265.03 421.6034 -2091.685 -438.3746 
(1 vs base) 3 -1110.79 421.1068 -1936.472 -285.109 


The first contrast is of female = 1 against female = 1 at income = 10. 
The difference in predictive margins is — 1419.269 and is statistically 
significant at level 0.05 because t = —1419.269/477.7069 = —2.97. This 
same result was obtained earlier using the pwcompare (effects) option. 
Then the difference (1 1) versus (1 0) equaled — 1419.269 with standard 
error 477.71 and p = 0.003. However, in the current contrast example, 
only 3 of the 15 pairwise contrasts of the pwcompare example appear, 
namely, (1 1) versus (1 0), (2 1) versus (2 0), and (3 1) versus (3 0). 


The contrast and margins, contrast commands allow many different 
comparisons at various levels of nesting; see the Stata documentation. For 
simpler examples in which each explanatory variable changes in isolation, it 
can be easier to use the margins, dydx() command, which we present next. 


4.5 Marginal effects 


A ME, or partial effect, most often measures the effect on the conditional 
mean of y of a change in one of the explanatory variables, say j, when all 
other variables are unchanged. A brief introduction was given in 

section 3.5.9; further detail is provided here. 


In a model with both parameters and regressors entering linearly, the ME 
equals the relevant slope coefficient, greatly simplifying analysis. An 
example is the model E(y|x) = 61 + Box, for which the ME is simply 82. 


For models with nonlinearity in either parameters or explanatory 
variables, the ME may no longer equal the slope coefficient. Examples include 
the quadratic model E(y|x) = 6, + Box + 63x, the interactive model 
E(y|x, z) = 8, + Cox + 63x” x z, and the exponential model 


E(y|x) = exp(8i + 622). 


There are then two complications. First, the ME depends on the particular 
values of the regressors at which it is computed. Second, calculus and finite- 
difference methods lead to different values of the ME. There are many ways to 
compute MEs, and the researcher should explicitly state the method used. 


4.5.1 Calculus and finite-difference methods 


Using calculus methods, we see the ME of the 7th explanatory variable is 


OE(y|x = x") 


ME; = 
3 F 


For the linear model with E(y|x) = 81 + 62x, we obtain simply ME, = b2. 
For the quadratic model with E(y|x) = 8, + Box + 83x°, however, we 
obtain ME, = Bə + 2683x. This ME is not simply the relevant parameter Bə (or 
(3), and it varies with the point of evaluation z*. 


Calculus methods can always be applied but are not always the most 
appropriate method. In particular, for a (0, 1) indicator variable, we are 
interested in the change in the conditional mean when the indicator variable 
changes from 0 to 1. Or for the continuous variable age, we may be interested 
in the impact of age increasing from 40 to 60. 


Let x’ = (z w’), where interest lies in the ME of the scalar variable z and 
w denotes all regressors other than z. Then the finite difference or discrete 
difference when z changes from 21 to 22 is 


AE(y) = E(y|w = w*, z = 22) — E(y|w = w*, z = 21) 


For a (0, 1) indicator variable, this is simply 

E(y|w*, z = 1) — E(y|w*, z = 0). More generally, for a set of indicators 
for K categories, the comparison made is between each category and the base 
category. For continuous regressors, a common discrete change to consider is 
an increase of one standard deviation from the sample mean value of the 
regressor of interest. 


For the linear regression model with explanatory variables entering 
linearly, calculus and finite-difference methods give exactly the same result. 
For example, if E(y|x) = 6, + Box then the finite-difference method with a 
one-unit change from g* to x* + 1 yields ME 
= Bo{(a* + 1) — x*)} = b2 = OE(y|x) /Oz. If explanatory variables enter 
nonlinearly, then this is no longer the case. For example, if 
E(y|x) = 8, + Box + 83x? then the finite-difference method with a one-unit 
change from 7* to x* + 1 yields ME 
= {Bo(a* +1) + B3(a* + 1)*} — {63a* + Bax**} = Bo + 2630" + 63, 
whereas calculus methods evaluated at 7* yield ME = Gy + 2632*. 


For intrinsically nonlinear regression models, such as logit and Poisson 
models, calculus and finite-difference methods always give different results; 
see section 13.7. If interest lies in a unit change in the regressor, then in 
nonlinear models calculus methods provide only an approximation. 


4.5.2 Average marginal effect, ME at mean, and ME at a representative 
value 


Using calculus methods, we see the estimated ME of a change in variable £j, 
evaluated at x; = xž, í =1,..., N, 1s computed as 


d 5 > OE (ulz: = x) 
J Ns aT 


1=1 x,=x* 
a 


where Ely; |x; = x*) is the fitted value when x; = x;. More generally, MEs 
for the quantity g(x, 3) are computed by replacing E(y;|x;) with g(x;, 3). 


Three common choices of evaluation of x* are 1) at sample values; 2) at 
the sample mean of the regressors; and 3) at representative values of the 
regressors. In the last two cases, there is no need to average because x; takes 
only one value. We use the following acronyms, where the first two follow 
Bartus (2005). 


AME Average marginal effect Average of MES at x* = x; 
MEM Marginal effect at mean ME at x7 =X 


u 
MER Marginal effect at a representative value ME at x; 


X 
x* 


In most examples in this book, we compute the in-sample AME because 
this provides simple interpretation of results obtained from fitting models 
with nonlinear effects. In a nonlinear model, the average response, measured 
by the ame, differs from the response of the average individual, measured by 
the MEM. In specific applications, the MEM or the MER may be of more 
immediate interest than the AME. 


The three quantities can be computed using the margins postestimation 
command with the dydx() option. Combinations of these methods are also 
possible. For example, several regressors may be set to a representative 
value, while the remaining regressors are evaluated at sample values, and 


then the average is taken. The Stata documentation refers to such hybrids as 
conditional MEs, rather than AMEs. 


In the special case that all variables are set to a single value 
(representative or mean) for all individuals, the evaluation can be done once. 
There is no need to compute for all individuals and average. 


Using the finite-difference method, we see the estimated ME of a change 
in scalar variable z from 21 to 22 evaluated at w; = wž¥, i = 1,..., N, is 
computed as 


N 
ME; = — > {Ê (yilwt,z = z2) - Ê (ylw, z = 21) } 


1 


z|= 


4.5.3 The margins, dydx() command 


MEs can be obtained with the margins, dyax (varlist) command, where 
varlist is a list of variables. The margins, dydx (*) command computes MES 
for all explanatory variables. As with the basic margins command (see 
section 4.4.2), evaluation can be at specified values of some of or all the 
explanatory variables (the at () option) or at mean values of all explanatory 
variables (the atmeans option). The default is to estimate at sample values of 
variables. The margins, dydx() command supersedes Stata’s mfx command 
and the community-contributed margeff command. 


Thus, if interest lies in the average response across the sample, one 
computes the AME using the simple command margins, dydx(*). 


And if instead the interest lies in the response of the average individual in 
the sample, one computes the MEM using the command margins, dydx () 


atmeans. 


The default is to compute MEs using calculus methods for all variables, 
other than variables identified as factor-indicator variables, for which the 
finite-difference method is used. The option continuous instead also uses 
calculus methods for factor-indicator variables. 


The default standard errors for estimated MEs treat the regressors as fixed. 
If a population ME is desired and sample values of regressors are used in 
computing the ME, then the vce (unconditional) option additionally controls 
for variation in the regressors due to sampling; see section 13.7.9. 


If population AMEs are desired and population weights are provided, then 
one needs to compute weighted margins. If weight was used in the preceding 
estimation command, then weighted MEs are automatically given. If the 
preceding estimation command did not use weights, as is commonly the case 
in microeconometrics, but weighted MEs are desired, then one needs to 
subsequently add weight to the margins, dydx() command. 


4.5.4 Average marginal effects 


If explanatory variables enter nonlinearly, such as quadratically or 
interactively, then the original estimation command needs to enter those 
variables using factor-variable operators. For completeness, we use factor- 
variable operators to define all variables in the original OLS regression. 


The following example computes AMEs for all regressors in the same 
regression model as that used in the previous section. 


. * AMEs computed with factor variables used to define model 
. qui regress totexp i.suppins i.phylim i.actlim c.totchr c.age##c.age 
> i.female##c.income, vce(robust) noheader 


. margins, dydx(*) nofvlabel 


Average marginal effects Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
dy/dx wrt: 1.suppins 1.phylim 1.actlim totchr age 1.female income 


Delta-method 
dy/dx std. err. t P>|t| [95% conf. interval] 
1.suppins 854.581 413.6253 2.07 0.039 43.56894 1665.593 
1.phylim 2432.789 536.4219 4.54 0.000 1381.004 3484.573 
1.actlim 3779.804 690.7463 5.47 0.000 2425.429 5134.178 
totchr 1921.016 180.5999 10.64 0.000 1566.906 2275.125 
age -59.86769 39.07924 -1.53 0.126 -136.492 16.75658 
1.female -1226.86 415.9167 -2.95 0.003 -2042.365 -411.3547 
income 11.03041 8.425407 1.31 0.191 -5.489631 27.55045 


Note: dy/dx for factor levels is the discrete change from the base level. 


For the factor-indicator variables suppins, phylim, actlim, and female the 
AME is computed using a one-unit change from 0 to 1. For the continuous 
variables totchr, age, and income, calculus methods are used. The AME for 
totchr, which is continuous and does not appear interactively, is computed 
using calculus methods regardless of whether it is entered as totchr or as 


eC EOtLCHY. 


Because the variables suppins, phylim, actlim, and totchr enter 
noninteractively, the AMEs for these variables equal the OLS slope coefficients 
given in the preceding section. For the remaining variables age, female, and 
income that appear interactively, the computation of AMEs requires additional 
analysis. 


The following example manually computes the same AMEs for variables 


age, female, and income. We use 


MEage = Dae ae 20 age-squared x age 
ME ¢emale = Btemale F Direnmaiexincome x income 
MEincome = Dincone T Ptemalexincome x female 


* AMEs computed manually 


. generate meage = _bl[age] + 2*_b[c.age#c.age] *age 
. generate mefemale = _b[1.female] + _b[1.female#c.income] *income 
. generate meincome = _b[income] + _b[1.female#c.income]*female 


sum meage mefemale meincome 


Variable Obs Mean Std. dev. Min Max 

meage 3,064 -59.86769 120.4742 -359.0868 113.5138 
mefemale 3,064 -1226.86 347.5765 -1588.932 3245 .848 
meincome 3,064 11.03041 7.614758 2.090179 17.51409 


The complex names for stored coefficients such as_b[1.female#c.income] 
were obtained by first using the coeflegend option of the regress command 
or by giving the command matrix list e(b) following the regress 
command. 


The resulting AMEs are the same as those obtained more simply using the 
margins, dydx(*) command. Besides simplicity, the margins, dydx (*) 


command has the advantage of reporting valid standard errors for the AMEs. 


For factor-indicator variables, the margins, dydx(*) command uses the 
finite-difference method. Finite differences can equivalently be calculated 
using the margins, contrast command. In fact, from the previous section, 
the command margins female, contrast (ci) calculated the same difference 
of — 1226.86 with the same standard error of 415.9167. The margins, 
contrast command allows comparisons of PMs when combinations of 
variables change, whereas margins, dydx(*) allows only isolated changes in 
each variable. 


What if the margins, dydx(*) command was used without first using 
factor variables to define the model? The following example does this, where 
first we need to generate variables for the quadratic regressor age2 and the 
interactive regressor female x income . We have 


* Wrong way to compute (average) MEs if any nonlinearity present 
. generate agesq = age^2 


. generate fembyinc = female*income 


. qui regress totexp suppins phylim actlim totchr age agesq 
> female income fembyinc, vce(robust) 


. margins, dydx(*) 


Average marginal effects Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
dy/dx wrt: suppins phylim actlim totchr age agesq female income fembyinc 


Delta-method 


dy/dx std. err. t P>|t| [95% conf. interval] 

suppins 854.581 413.6253 2.07 0.039 43.56894 1665.593 
phylim 2432.789 536.4219 4.54 0.000 1381.004 3484.573 
actlim 3779.804 690.7463 5.47 0.000 2425.429 5134.178 
totchr 1921.016 180.5999 10.64 0.000 1566.906 2275.125 
age 1342.275 763.421 1.76 0.079 -154.5957 2839.146 
agesq -9.452011 5.041687 -1.87 0.061 -19.33745 . 433432 
female -1573.508 573.3188 -2.74 0.006 -2697 . 638 -449.3781 
income 2.090179 11.27985 0.19 0.853 -20.02669 24.20705 
fembyinc 15.42391 15.81709 0.98 0.330 -15.58931 46.43713 


The underlying OLS estimates are suppressed but are the same as before. But 
now the AMEs for variables age, female, and income are markedly different, 
and there are additional AMEs for the regressors agesq and fembyinc. This is 


the wrong approach. For example, the AME for age is computed assuming 
agesq does not change, even though age has changed. 


At the same time, the AMEs are unchanged for the first four variables that 
did not appear nonlinearly via a quadratic or interactive term. This 
equivalence for the categorical regressors suppins, phylim, and actlimisa 
special property of a linear model because then the finite-difference and 
calculus methods give the same ME. Different MEs would arise for a nonlinear 
model such as the logit model. 


4.5.5 Marginal effect at mean 


The MEM is conceptually quite different from the AME because the ME for the 
average individual in general does not equal the average of the MEs for each 
individual. 


The two are equal in a purely linear model. They are also equal in this 
particular example because of the particular form of nonlinearity specified. 
Consider a model with quadratic terms and interactions with 

das = bı + Box + p31? + Baz + B5z° + Bexz. Then ME 
= Bot 20a + bez. So AME J A - 
= (1/N) E Bo + 28321 + Jezi = 2 + 283% + Boz But this last 
a equals MEMz. Similarly, AME, = MEM,. Note that there would be a 
difference if we used finite-difference rather than calculus methods. 


Thus, in this example, margins, dydx(*) atmeans gives exactly the same 
estimates and standard errors as already obtained using command margins, 
dydx(*). 


4.5.6 Marginal effect at a representative value 


The following example computes the MER for each variable when all 
variables are set to specified values for a woman aged 70 years with 

supplemental insurance, physical and activity limitation, two chronic 
conditions, and annual income of $20,000. 


. * MER computed with factor variables 
. qui regress totexp i.suppins i.phylim i.actlim c.totchr c.age##c.age 


> i.female##c.income, vce(robust) noheader 


. margins, dydx(*) at(suppins=1 phylim=1 actlim=1 totchr=2 age=70 female=1 
> income=20) nofvlabel 


Conditional marginal effects 


Model VCE: Robust 


Number of obs = 3,064 


Expression: Linear prediction, predict() 
dy/dx wrt: 1.suppins 1.phylim 1.actlim totchr age 1.female income 
At: suppins = 
phylim = 1 
actlim = 1 
totchr = 2 
age = 70 
female = 1 
income = 20 
Delta-method 
dy/dx std. err. t P>ltl [95% conf. interval] 
1.suppins 854.581 413.6253 2.07 0.039 43.56894 1665.593 
1.phylim 2432.789 536.4219 4.54 0.000 1381.004 3484.573 
1.actlim 3779.804 690.7463 5.47 0.000 2425.429 5134.178 
totchr 1921.016 180.5999 10.64 0.000 1566.906 2275.125 
age 18.99368 67.30506 0.28 0.778 -112.9741 150.9615 
1.female -1265.03 421.6034 -3.00 0.003 -2091.685 -438.3746 
income 17.51409 11.63454 1.51 0.132 -5.298229 40.32641 


Note: dy/dx for factor levels is the discrete change from the base level. 


The MER for the variables age, female, and income that entered interactively 
differ from the AMEs presented earlier. The MEs equal the ames for the first 
four variables that did not appear nonlinearly via a quadratic or interactive 


term. 


More generally, for other estimation commands such as logit, the MERS 
would differ from the AMEs for all variables in the regression. 


4.5.7 Elasticities 


The preceding MEs estimate the change in the conditional mean as 
explanatory variables change. These MEs use changes in levels. In some 
cases, it may be more informative to instead estimate elasticities that use 
proportionate or percentage changes. For example, we may be interested in 


the percentage change in medical expenditures in response to a 1% change in 
income. 


Elasticities and semielasticities can be computed using the eyex (), 
dyex(), and eydx() options of the margins command. Again, these can be 
computed at specified values of the regressors or at sample values for each 
observation and then averaged. As explained below, it is better to evaluate 
elasticities and semielasticities at specified values of the regressors. 


The elasticity of y with respect to x is Oy/Ox x (x/y). Because the 
elasticity can be rewritten as (Oy/y)/(Ox/2), it is interpreted as the 
proportionate change in y divided by the proportionate change in x or, 
equivalently, as the percentage change in y in response to a 1% change in z. 
To compute the elasticity, we use the eyex() option of the margins 
command. 


Let x; denote the value at which the jth observation is evaluated. For 
example, if all variables are evaluated at sample means, then x* = x, while if 
all variables are evaluated at their sample values, then x; = x;. Then, using 
calculus methods, we see the eyex () option of the margins command 
computes the elasticity with respect to variable 7; as 


where Vi = E(y;|x; = Ms 


The following example computes the sample average elasticity of totexp 
with respect to variable income: 


. * Sample average elasticity with respect to income 
. margins, eyex(income) 


Average marginal effects Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
ey/ex wrt: income 


Delta-method 
ey/ex std. err. t P>|t | [95% conf. interval] 


income - .0474743 13.56076 -0.00 0.997 -26 . 63662 26.54167 


On average, a 1% change in income is associated with a 0.047% reduction in 
medical expenditures. This estimate is very noisy with a very large standard 
error and is highly statistically insignificant. 


The following code manually computes the same average elasticity. 


. * Manual computation of sample average elasticity with respect to income 
. predict yhat 
(option xb assumed; fitted values) 


. generate eincome = (_b[income]+_b[1.female#c.income]*female) * income / yhat 
. sum eincome 


Variable | Obs Mean Std. dev. Min Max 


eincome | 3,064 -.0474743 3.839305 -199.924 3.528567 


Again, the average estimated elasticity is — 0.04747. But note that it varies 
greatly across observations, ranging from an implausibly large negative 

— 199.924 to an implausibly large positive 3.529. This is most likely due to 
division by y; when %; takes values close to zero. 


It can therefore be better to compute the elasticity at specified values of 
regressors, either representative values or the mean value of regressors. For 
example, 


. * Elasticity with respect to income evaluated at representative values 
. margins, eyex(income) at(suppins=1 phylim=1 actlim=1 totchr=2 age=70 
> female=1 income=20) 


Conditional marginal effects Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
ey/ex wrt: income 


At: suppins = 1 
phylim = 1 
actlim = 1 
totchr = 2 
age = 70 
female = 1 
income = 20 


Delta-method 
ey/ex std. err. t P>|tl [95% conf. interval] 


income . 0294402 .0197906 1.49 0.137 -.0093641 . 0682444 


When evaluation is at the specified values of all regressors, a 1% increase in 
income is associated with a 0.0294% increase in medical expenditures. This 
estimate is much more precisely estimated than the preceding sample average 
elasticity, though it is still statistically insignificant at level 0.05. 


Elasticities are not meaningful for categorical variables, because it is not 
meaningful to consider, for example, the effect of a 1% change in gender. 
Accordingly, the eyex() option is not available for factor-indicator variables. 


4.5.8 Semielasticities 


For indicator variables such as female or act lim, it is more meaningful to 
consider the proportionate change in medical expenditures when there is a 
one-unit change in the indicator variable. Such semielasticities can be 
obtained using the eydx () option. 


For example, doing so at specified values of the variables yields 


. * Semielasticities: Proportionate change in y w.r.t. unit change in x 
. margins, eydx(*) at(suppins=1 phylim=1 actlim=1 totchr=2 age=70 
> female=1 income=20) nofvlabel 


Conditional marginal effects 


Model VCE: Robust 


Number of obs = 3,064 


Expression: Linear prediction, predict() 
ey/dx wrt: 1.suppins 1.phylim 1.actlim totchr age 1.female income 
At: suppins = 
phylim = 1 
actlim = 1 
totchr 2 
age = 70 
female = 1 
income = 20 
Delta-method 
ey/dx std. err. t P>|t| [95% conf. interval] 
1.suppins 074535 . 0360862 2.07 0.039 .0037794 . 1452906 
1.phylim . 2287452 0551654 4.15 0.000 . 1205802 . 3369102 
1.actlim . 3822587 . 069224 5.52 0.000 . 2465284 .5179889 
totchr . 1614558 .0186426 8.66 0.000 . 1249024 . 1980091 
age .0015964 .005679 0.28 0.779 - .0095386 .0127313 
1.female -.101041 . 0338938 2.98 0.003 -.1674981 - .034584 
income 001472 0009895 1.49 0.137 - .0004682 .0034122 


Note: ey/dx for factor levels is the discrete change from the base level. 


For example, those with an activity limitation have medical expenditures that 
are a proportion 0.38 larger, or 38% larger, than medical expenditures for 
those without an activity limitation. And a one-unit change in variable 
income (a $1,000 increase in annual income) is associated with a 
100 x 0.001472 = 0.1472% increase in medical expenditures. 


Interest may also lie in the impact on the level of the dependent variable 


of a proportionate change in a regressor. Such semielasticities can be 


obtained using the dyex () option, which computes for the jth variable the 
quantity paw (Oy; /Oxi;)|,« X Xj;- For indicator variables such as female, 


it makes no sense to consider a proportionate change, and the dyex() option 
is not available for factor-indicator variables. 


Continuing the current example, we have 


. * Semielasticities: Change in y w.r.t. proportionate change in x 

. margins, dyex(totchr age income) at(suppins=1 phylim=1 actlim=1 totchr=2 

> age=70 female=1 income=20) 

Conditional marginal effects Number of obs = 3,064 
Model VCE: Robust 


Expression: Linear prediction, predict() 
dy/ex wrt: totchr age income 
At: suppins = 1 


phylim = 1 
actlim = 1 
totchr = 2 
age = 70 
female = i 
income = 20 


Delta-method 


dy/ex std. err. t P>|t| [95% conf. interval] 

totchr 3842.031 361.1998 10.64 0.000 3133.812 4550.251 
age 1329.557 4711.354 0.28 0.778 -7908.188 10567.3 
income 350.2818 232.6908 1.51 0.132 -105.9646 806 .5283 


For example, a one-unit proportionate change or 100% change in the number 
of chronic conditions is associated with a $3,842 increase in medical 
expenditures. So a 1% change in the number of chronic conditions is 
associated with a $38 increase in medical expenditures. 


4.6 Regression decomposition analysis 


Often, regression data cover more than one group. Examples of dichotomous 
groupings include wage or earnings data that cover both union and nonunion 
workers and earnings data that cover both male and female workers or 
natives and immigrants. 


Group membership is often heterogeneous in terms of observable 
characteristics such as age, experience, educational attainment, and so forth. 
When there are significant differences in the average market outcomes of 
groups, one is often interested in quantifying the relative contributions of 
observable characteristics and unobservable factors to this difference. For 
example, to what extent can lower average earnings of women compared 
with men be explained by observable characteristics? In the regression 
context, such analysis is called regression decomposition analysis. 


4.6.1 Oaxaca—Blinder decomposition 


A well-established framework for carrying out regression decomposition 
analysis is the so-called Oaxaca—Blinder decomposition of earnings or log 
earnings. This decomposition has been applied to study male-female wage 
differentials and union—nonunion wage differentials and also more 
specifically used in the context of analysis of gender discrimination. See 


well as many modern interpretations and extensions. 


Decomposition analysis can be interpreted as a form of counterfactual 
analysis that compares an actual outcome with an alternative hypothetical 
outcome generated under a different scenario or assumptions, as is standard 
in the potential-outcome model framework presented in section 24.2. 


For specificity, we consider a standard linear regression model. Let y? 
denote the outcome variable, here the natural logarithm of annual earnings, 
for individual ; in group g. And let x? (i = 1,..., Ng; k=1,..., K) 
denote the kth regressor for individual ; in group 9. For simplicity, assume 
two groups, here male and female workers with g = F or M. The sample 


consists of observations on (y,x7,,...,27,). The regression equation of 
interest is 


P= +) off +uf, g=M,F 
k=1 


which allows all coefficients in the regression to differ between groups. 


The average group differential is A = yë —y™, where y’ is the group 
sample average. Under the assumption that { E(uf |x¥) — E(u” |x} )} =0 
, which means that any potential selection effect is the same for the two 
groups, the sample estimate of the average male—female differential is 
estimated to be 


K K 
Rag? pte (38 DIE ) : (a DID ) 


k=1 k=1 


Two alternative Oaxaca—Blinder decompositions corresponding to this 
difference are easily derived: 


A, = (E-A) + cE — aM!) BM + Do TIE 


k=1 k=1 
K K 
A, = (B3' — BS) + D0 (we — a) BE + D0 (BE — BRE) at 


The second term in each decomposition is interpreted as the component of 
the differential that is accounted for by the mean differences in the 
observable regressors. This measures the “explained” component of the 
difference. The first and third terms are attributed to unobserved differences 
between the male and female groups that lead to differences in their 


regression coefficients. These terms measure the “unexplained” components 
of the decomposition. 


Whether one uses A, or A> depends upon which group is viewed as the 
reference group and which group is viewed as the treatment group. Yet 
another alternative is to use the regression estimates 6? obtained from the 
pooled sample of observations in place of either pM or BE in the second 
term; a can be interpreted as a matrix-weighted average of pM and BF 


Separate groupwise regressions yield the point estimates of the 
regression coefficients. If point estimates and confidence intervals are both 
required for the mean differential, then the variance estimates of sums of 
product terms like (af = Te ) pM and (BF = BM yar also have to be 
computed. This step is usually based on linearized approximations such as 
the delta method. For additional detail, see Fortin, Lemieux, and 
Firpo (2011). Many additional complications arise in practice, such as 
endogenous regressors and regressors that are defined only for one of the 
groups. 


4.6.2 An empirical example 


In this empirical example, we use a subset of data from the Australian 
longitudinal survey, Medicine in Australia: Balancing Employment and Life. 
This survey provides information on the annual earnings of physicians in 
general practice. The dependent variable 1ogyearn is the natural logarithm 
of annual earnings. The regressors are annual hours worked (yhrs); years of 
medical practice experience (expr) and its square (exprsq); fellowship of 
colleges and number of other postgraduate qualifications (fellow and 
pgradoth), which are two measures of professional qualifications; the 
number of full-time or part-time doctors in a practice (pracsize); presence 
in home of a child under five years of age (childu5); and an indicator of 
temporary resident visa (visa). 


Before we carry out a decomposition analysis, it is straightforward to 
check for differences in the two groups using a two-sample ¢ test of 
difference in means. For example, for the dependent variable Logyearn, we 
obtain 


* t tests on difference in mean log-earnings by gender 
. qui use mus204mabeldecomp, clear 


. ttest logyearn, by(female) unequal 


Two-sample t test with unequal variances 


Group Obs Mean Std. err. Std. dev. [95% conf. interval] 
Male 8,970 12.47267 .0074711 . 707592 12.45802 12.48731 
Female 5,060 11.86746 . 0098909 . 7035751 11.84807 11.88685 
Combined 14,030 12.25439 . 0064466 . 7635899 12.24176 12.26703 
diff .605207 .0123955 . 5809096 .6295045 
diff = mean(Male) - mean(Female) t = 48.8249 
HO: diff = 0 Satterthwaite’s degrees of freedom = 10542.9 
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 
Pr(T < t) = 1.0000 Pr(|TI > Itl) = 0.0000 Pr(T > t) = 0.0000 


The null hypothesis of equality of means is easily rejected. There is a large 
gender difference of 0.61 in the average of Logyearn. (Separate analysis of 
the earnings data in levels finds that average male earnings are 80% higher 
than average female earnings.) 


Part of this large earnings differences may be due to differences in 
average characteristics by gender. In particular, male doctors annually 
average 2,377 annual hours worked versus 1,781 hours for female doctors. 


Gender difference in regressors 


We first consider gender differences in the averages of the regressors. Using 
the expanded table command introduced in Stata 17, we obtain 


* Gender differences in variables for the regression estimation sample 
. global xlisti yhrs expr exprsq fellow pgradoth pracsize childu5 visa 


. qui regress logyearn $xlist1 


. table () (female) if e(sample), stat(mean logyearn $xlist1) 


> stat(count logyearn) nototals 
Female 
Male Female 
Mean 
log of annual earnings 12.1852 11.62744 
Annual hours worked 2339.708 1661.385 
Years of medical practice experience 26.18344 19.94557 
expr squared 803.7421 481.9785 
Fellowship of colleges . 5632425 . 608076 
Number of other postgrad qualifications .546471 . 6207443 
Number of FT/PT doctors in practice: Bands 3.222572 3.426762 
Have dep child udr 5y . 1439553 . 175772 
Temporary resident visa . 0230608 .0126683 
Number of nonmissing values 
log of annual earnings 2,862 2,526 


Because many variables have missing values, regression analysis uses a total 
of 5,388 observations, compared with 14,030 observations available for 
logyearn. On average, male doctors work 678 more hours annually and 
have 6.24 more years of experience, while women have somewhat higher 
professional qualifications. 


Gender difference in regression coefficients 


We run separate OLS regressions for male and female doctors and use the 
estimates table command to enable easy comparison of regression 
coefficients by gender. 


* Separate earnings regressions for male and female doctors 
. qui regress logyearn $xlist1 if female==0, vce(robust) 


. estimates store MALE 

. qui regress logyearn $xlist1 if female==1, vce(robust) 

. estimates store FEMALE 

. estimates table MALE FEMALE, b(411.5f) t(%11.2f) stats(N r2 F) 


Variable MALE FEMALE 
yhrs 0.00039 0.00056 
21.18 27.07 
expr 0.02452 0.00112 
6.35 0.26 
exprsq -0.00050 -0.00000 
-6.73 -0.02 
fellow -0 . 00457 0.10117 
-0.20 4.53 
pgradoth 0.03375 0.00276 
2.79 0.20 
pracsize 0.04418 0.00697 
4.82 0.69 
childu5 0.15863 0.02545 
5.43 0.92 
visa -0.06945 0.14580 
-1.43 2.06 
_cons 10.85005 10.58729 
152.44 144.30 
N 2862 2526 
r2 0.25497 0.38641 
F 83.49789 108.14843 

Legend: b/t 


The two groups differ substantially in their sensitivity to annual hours 
worked, which is the most important and statistically significant driver of the 
differential. And the size of estimated coefficients of the other regressors and 
their statistical significance varies considerably by gender. 


Formal tests of coefficient equality across gender require estimation of 
Var(3 un B p). This depends in part on Cov(B ais B p)» Which is not 
computed by the separate regressions by gender. So we use the suest 
command (see section 6.8.7) to enable cross-equation inference. 


* Tests of coefficient equality in separate regressions 
qui regress logyearn $xlist1i if female== 


estimates store MALE 


qui regress logyearn $xlist1 if female== 


estimates store FEMALE 


. suest MALE FEMALE 


Simultaneous results for MALE, FEMALE 


Number of obs = 5,388 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
MALE_mean 
yhrs . 0003909 .0000184 21.21 0.000 .0003548 . 000427 
expr .024517 .0038577 6.36 0.000 .0169561 .032078 
exprsq - . 0004987 . 000074 -6.74 0.000 - . 0006437 - .0003537 
fellow -.0045728 .0226987 -0.20 0.840 -.0490615 .0399159 
pgradoth .0337479 .0120848 2.79 0.005 .0100621 0574338 
pracsize .0441787 .0091453 4.83 0.000 .0262543 0621032 
childu5 . 1586252 .0291587 5.44 0.000 .1014751 . 2157753 
visa - .0694536 . 0483732 -1.44 0.151 -. 1642634 .0253562 
_cons 10.85005 .0710696 152.67 0.000 10.71075 10.98934 
MALE_lnvar 

_cons -1.201407 .0429545 -27 .97 0.000 -1.285597 -1.117218 

FEMALE_mean 
yhrs .0005571 . 0000205 27.12 0.000 .0005168 .0005974 
expr .001117 .0043552 0.26 0.798 -.0074191 .0096532 
exprsq -2.35e-06 0001024 -0.02 0.982 -.0002031 .0001984 
fellow .1011739 . 0223206 4.53 0.000 .0574263 .1449215 
pgradoth .0027625 .0138797 0.20 0.842 -.0244411 .0299661 
pracsize .0069707 .0101001 0.69 0.490 -.0128252 .0267665 
childu5 .0254493 .0277562 0.92 0.359 -.0289519 .0798504 
visa . 1457976 .0707606 2.06 0.039 .0071093 . 2844859 
_cons 10.58729 .0732471 144.54 0.000 10.44372 10.73085 

FEMALE_lnvar 
_cons -1.391516 .0408461 -34.07 0.000 -1.471573 -1.311459 


. test _b[MALE_mean:yhrs] 


( 1) 


1) = 


Prob > chi2 = 


= _b[FEMALE_mean: yhrs] 


36.26 
0.0000 


[MALE_mean]yhrs - [FEMALE_mean] yhrs 
chi2( 


The null hypothesis of gender equality of the coefficient of yhrs is strongly 
rejected. Note that suest by default gives heteroskedastic—robust standard 
errors, even though the individual regress commands did not specify the 


vce (robust) option. The standard errors and ¢ statistics differ slightly from 
those obtained using the individual regress commands because of slightly 
different degrees-of-freedom correction. 


Given the difference in coefficients by gender, combined with the earlier 
observation that regressor averages vary by gender, we expect that a full 
decomposition will show that both the differences in regressors 
(“endowments”) and differences in coefficients will jointly account for the 
differences in average earnings of the two groups. 


Regression decomposition estimates 


We could carry out the decomposition analysis by running separate 
regressions for male and female doctors and then building up the 
components of the differential. This will require some facility with matrix 
operations in Stata. Instead, we use the community-contributed command 
oaxaca (Jann 2008), which provides the decomposition in a more compact 
form. We obtain 


. x Blinder-Oaxaca decomposition in logarithmic units 
. oaxaca logyearn $xlisti, by(female) vce(robust) 


Blinder-Oaxaca decomposition Number of obs = 5,388 
Model = linear 
Group 1: female = 0 N of obs 1 = 2862 
Group 2: female = 1 N of obs 2 = 2526 
Robust 
logyearn | Coefficient std. err. Zz P>|z| [95% conf. interval] 
overall 
group_1 12.1852 .0118725 1026.34 0.000 12.16193 12.20847 
group_2 11.62744 .0126594 918.48 0.000 11.60263 11.65225 
difference .5577628 .0173556 32.14 0.000 . 5237466 :5917791 
endowments . 3786503 .0201843 18.76 0.000 . 3390898 .4182107 
coefficients .317233 .0201198 15.77 0.000 .2777989 . 3566672 
interaction -.1381205 .0227893 -6.06 0.000 -.1827868 -.0934542 
endowments 
yhrs .3778961 .0176741 21.38 0.000 . 3432555 .4125366 
expr . 006968 .0272152 0.26 0.798 - .0463728 . 0603088 
exprsq - .0007548 . 0330052 -0.02 0.982 - .0654437 0639341 
fellow - .004536 0016885 -2.69 0.007 -.0078454 -.0012265 
pgradoth - .0002052 0010343 -0.20 0.843 - .0022324 0018221 
pracsize - .0014233 . 0020779 -0.68 0.493 -.005496 . 0026493 
childu5 - . 0008097 . 0009206 -0.88 0.379 -.0026141 . 0009947 
visa .0015152 . 0009029 1.68 0.093 -.0002545 . 0032849 
coefficients 
yhrs -.2761344 .0459881 -6.00 0.000 -.3662693 -.1859995 
expr . 4667255 . 1163091 4.01 0.000 . 2387638 .6946871 
exprsq -. 239237 .0611112 -3.91 0.000 -.3590127 -.1194612 
fellow - .0643021 0194158 -3.31 0.001 -.1023563 -.0262479 
pgradoth .0192341 .0114525 1.68 0.093 -.0032124 .0416805 
pracsize . 1275032 .0467723 2.73 0.006 .0358313 .2191751 
childu5 . 0234086 .0071587 3.27 0.001 . 0093778 . 0374394 
visa - .0027269 0011885 -2.29 0.022 -.0050562 -.0003975 
_cons . 2627621 . 1022213 2.57 0.010 .0624119 .4631122 
interaction 
yhrs -.1127423 .0190288 -5.92 0.000 -.150038 -.0754466 
expr . 145966 . 0369083 3.95 0.000 .0736269 .218305 
exprsq -.159712 .0412702 -3.87 0.000 -.2406001 -.0788239 
fellow 004741 0020151 2.35 0.019 0007915 . 0086904 
pgradoth -.0023014 0015225 -1.51 0.131 - .0052854 . 0006826 
pracsize -.0075975 . 0030317 -2.51 0.012 -.0135395 -.0016556 
childu5 - . 0042372 .0018513 -2.29 0.022 -.0078657 -.0006087 
visa - .002237 0011792 -1.90 0.058 - .0045482 . 0000741 


The first segment of the output is most informative. The use of the log 
transformation results in the decomposition in terms of log earnings 
differences between the two groups. The decomposition shows the mean 
predicted log of yearly earnings for males and females, respectively, in the 
first two rows. The third row shows that the average differential is 0.5578. 
Of this differential, 0.3787 is accounted for by the differences in 
“endowments”, which refers to the mean differences in the regressors of the 
two groups, and a further 0.3172 is accounted for by the differences in the 
regression coefficients. The sum exceeds 0.5578; the difference of — 0.1381 
is attributed to an interaction effect, the combined effect of the previous two 
effects. The components of the decomposition are quite precisely estimated, 
and both endowments and returns to the endowments matter. 


The eform option of the oaxaca command executes a retransformation 
and allows the user to display the differential in natural units (Australian 
dollars). 


The eform option exponentiates all the values in the preceding table. For 
example, exp(0.5578) = 1.7468. This provides a good guide, but note that it 
does not control for the retransformation bias studied in section 4.2.3. From 
output not reproduced here, male earnings are overall 1.75 times female 
earnings (or 75% higher), with a multiple 1.46 attributed to gender 
differences in the levels of individual characteristics, such as hours worked, 
and a multiple 1.37 attributed to different returns to these individual 
characteristics. 


4.7 Shapley decomposition of relative regressor importance 


When using linear regression analysis, an investigator may be interested in 
considering the relative importance of the regressors. Given a list of 
regressors, Say, £1,- -, TK, is there a systematic procedure for quantifying 
their relative contributions to the goodness of fit, measured in terms of R2 or 
residual variance of the regression? 


As a specific example, suppose the variable of interest is a measure of 
income inequality. Then the decomposition aims to establish the percentage 
contribution of regressors to the reduction in unexplained variance of the 
inequality measure. Neither significance tests nor stepwise regression 
provides a solution. 


This decomposition problem is analogous to the “Shapley value 
decomposition” in cooperative game theory. In a case in which multiple 
inputs combine to produce some output, and inputs could be employed in 
any combination of inclusion and exclusion, then how are the contributions 
of individual inputs to be ranked? 


When applied to a regression model with K regressors, this 
decomposition generates a ranking of the regressors in terms of their 
contributions to the reduction in unexplained variance or to the increase in 
R2. This decomposition aims to measure the marginal contribution of factor 
Xk, which is the product of the ME and the change in £k, 3,Ax;. The 
reduction in the unexplained variance, 07, may be expressed as of — 07, — 
the difference between total unexplained variance and unexplained variance 
after adding x; to the regression. 


The marginal contribution of a variable depends upon the order in which 
the variable of interest is included in the regression. Thus, adding zx to a 
regression with only xı included will yield a different estimate of the 
marginal contribution to that obtained from the regression in which zx is 
added to xı and £2. This issue is resolved by averaging the marginal 
contribution of a regressor over all g! unique regressions where the 
regressions include those with some of the regressors omitted; see 
Shorrocks (2013). 


Implementation of the Shapley value calculation simply requires 
estimation of K! linear regressions followed by averaging of the estimated 
marginal impacts. This yields an estimate of expected marginal impact when 
the expectation is over all possible combinations of regressors. The result is 
expressed as a percentage increase in the regression R2, or as a percentage 
decrease in residual variance, due to a particular factor. 


A modified version of the Shapley decomposition is the Shapley-Owen 
decomposition, which allows one to group subsets of regressors and measure 
the group contribution to the improvement in unexplained variance. For 
example, if a variable enters as a quadratic, then both the variable and its 
square constitute one group. 


4.7.1 Empirical example 


We continue with the same model as in the previous section and present the 
decomposition for the regression for men. Because the joint contribution of 
the experience variables exp and exprsq is of more interest than their 
separate contributions, we group these two variables. Similarly, the variables 
fellow and pgradoth are grouped. 


We use the community-contributed rego command (Huettner and 
Sunder 2012). 


. * Shapley-Owen decomposition of regressors”™ contributions to R-squared 
. qui rego logyearn yhrs expr exprsq pracsize fellow pgradoth 
> if female==0, vce(robust) 


. rego logyearn yhrs \ expr exprsq \ pracsize \ fellow pgradoth \ 
> if female==0, vce(robust) 


Gr Regressor Coef. Std.Err. P>|t| Std.Coef. Group %R2 
1 yhrs .0003912 ***  .0000184 0.000 0.4412 78.5485 
2 expr .0172065 *** .0036226 0.000 0.2943 16.8196 

exprsq -.0004 **#* .0000721 0.000 -0.3742 
3 pracsize .0451038 ***  .0090868 0.000 0.0896 2.2844 
4 fellow . 0063228 .0221348 0.775 0.0049 2.3476 
pgradoth .0361601 *** .0121119 0.003 0.0447 
= Intercept 10.97202 . 0643553 0.000 
Observations 2907 
Overall R2 0.25065 
Root MSE .5500918 
F-stat. Model 108.5585 *** 0.000 
Log Likelihood -2383.923 


Output is suppressed for the first command, which provides the Shapley 
decomposition when all regressors are treated as distinct. On execution, this 
would deliver the Shapley decomposition. The second command groups 
some of the variables and the last column of the table shows, for example, 
that the two experience variables jointly account for 16.8% of the fit. 


By far the most important variable is annual hours worked (yhrs), which 
accounts for 78.55% of the explained variation. 


4.8 Difference-in-differences estimators 


A popular application of linear regression is the estimation of treatment 
effects using observational data or data from natural experiments. The data 
setup is one in which observations are available on treated and untreated 
groups before and after the application of treatment. A treatment variable 
undergoes a change due to change in policy or a natural experiment—a 
change that can be treated as an exogenous variation in a treatment variable. 


To estimate the effect of the change, one can compare the outcome for 
the treated group with that for the untreated group. The treatment indicator is 
a single binary regressor that equals 1 if treatment occurs and equals 0 if 
treatment does not occur. 


An important complication is that the outcome variable of interest may 
change even in absence of the treatment. That is, for both groups some 
change in the outcome is expected to take place over time even in the 
absence of treatment. Any attempt to estimate the treatment effect must 
control for these factors because otherwise the estimated treatment effects 
will be contaminated by other factors. The difference-in-differences 
estimator is one way to eliminate the effects of these other factors provided 
assumptions given below are satisfied; chapters 24 and 25 present other 
methods for evaluating the effect of a treatment. 


We consider the before/after framework in which data are available on 
the treated and the comparison (control) groups both before (subscript b) and 
after (subscript a) the experiment. Let variable D take value 1 if treated and 
take value 0 otherwise. Then for the ith treated case, the change in the 
outcome over time is measured by (yia — Yib| Dia = 1), and the change in 
the outcome for the untreated group is measured by (Yia — Yib| Dia = 0). 
Then the difference-in-differences measure 
(Yia — Yib|Dia = 1) — (Yia — Yib| Dia = 0) forms the basis of an estimate of 
the treatment effect. 


The validity of this method relies on certain assumptions. We suppose 
that earnings Yit for each individual are determined by the sum of an 


individual-specific effect ġ;, a time-specific term 6,, and an amount a for 
those who receive treatment, where D;, = 1 if treated. So 


Yt = Qi + ôt +aDi+ eu, t=a,b 


where ô, is a common “trend” effect assumed to apply to both groups. 


Using the “before” and “after” formulation, and suppressing the model 
Errors, WE SEE Yia — Yin = A + (ôa — ôb) If Diz = 1 and 
Yia — Yib = (Oa — ôb) if Di: = 0. It follows that the treatment effect is given 
by 


E(Yia = Hel Dia = 1) = E (Yia >= Yib|Dia = 0) =Q 


Note that it is assumed that the trend over time, ôa — ô», is assumed to be the 
same for the treated and untreated groups, a crucial assumption called the 
parallel trends assumption. 


Replacing the expectations operator by the sample average, we see a is 
simply the difference between average change in the outcome of the treated 
group and of the untreated group, that is, & = Ay’” — Ay™. Any time- 
invariant factors, the potential confounders, are eliminated when we take a 
difference between the before and after outcomes. The assumption that the 
treatment effect and the confounding factors enter additively and separably 
makes it easier to eliminate the latter. 


In this section, we consider the simplest case of treatment at the 


individual level and two time periods (before and after). Extensions to richer 
settings are presented in section 25.6.1. 


4.8.1 Example: Estimating the effect of training on earnings 


As an illustration, we revisit the data from the National Supported Work 
(Nsw) demonstration project, conducted in the 1970’s. The target parameter, 


the impact of training on earnings, can be estimated by a randomized 
experiment that assigns some individuals to receive training (a treatment 
group) and others to receive no training (a control group). The effect of 
training could then be measured by direct comparison of sample means of 
posttreatment earnings for the treatment and control (reference) groups. 


Those who receive treatment may differ from those who do not in terms 
of traits that are observable. Moreover, such observable factors may also 
directly affect the average outcome. Hence, a comparison of the treated with 
the nontreated must control for differences in observed characteristics and 
possibly in unobserved characteristics. 


We use data from Dehejia and Wahba (1999) to illustrate the DD method. 
These data are a subset of data used in LaLonde (1986), who studied the 
outcomes of the NSW treated group and compared them with those of 
synthetic control groups drawn from two national surveys. 


The treated sample is one of 185 males who received training during 
1976-1977. The control group is one of 2,490 male household heads under 
the age of 55 who are not retired, drawn from the Panel Survey of Income 
Dynamics. Dehejia and Wahba (1999, 2002) call these two samples the re74 
subsample (of the NSw treated) and the Panel Survey of Income Dynamics 
sample (of nontreated). 


The treatment indicator variable D is defined as D = 1 if training was 
received (so the observation is in the treated sample) and D = 0 if no 
training was received (and the observation is in the control sample). The key 
variables are 


. qui use mus204nswpsid, clear 


* DID: Outcome, treatment, and other variables 
. describe re78 re75 treat age educ nodegree black hisp marr u74 u75 re74 


> agesq educsq re74sq re75sq u/74black u74hisp 
Variable Storage Display Value 
name type format label Variable label 

re78 float %9.0g Real annual earnings in 1978 
(posttreatment) 

re75 float %9.0g Real annual earnings in 1975 
(pretreatment) 

treat float %9.0g Treated 

age float 49.0g Age in years 

educ float 29.0g Years of education 

nodegree float %9.0g Years of education > 12 

black float {%9.0g Black 

hisp float 7%9.0g Hispanic 

marr float %9.0g Married 

u74 float %9.0g Unemployed in 1974 

u75 float %9.0g Unemployed in 1975 

re74 float %9.0g Real annual earnings in 1974 
(pretreatment) 

agesq float %9.0g Age in years squared 

educsq float 29.0g Education in years squared 

re74sq float {%9.0g Real annual earnings in 1974 squared 

re75sq float %9.0g Real annual earnings in 1975 squared 

u74black float 49 .0g Unemployed and black 

u74hisp float %9.0g Unemployed and hispanic 


The outcome of interest is posttreatment earnings, re78, the treatment 
variable is treat, and pretreatment earnings are re75. 


The table command provides descriptive statistics sorted by treatment 


status. 


. * Descriptive statistics for treatment and control samples 
. table () (treat), stat(mean re78 re75 treat age educ nodegree black hisp 


> 
> 


marr u74 u75 re74 agesq educsq re74sq re75sq u74black u74hisp) 
stat(count re75) nototals 


Treated 
0 1 
Mean 
Real annual earnings in 1978 (posttreatment) 21553.92 6349.145 
Real annual earnings in 1975 (pretreatment) 19063.34 1532.056 
Treated 0 1 
Age in years 34.8506 25.81622 
Years of education 12.11687 10.34595 
Years of education > 12 . 3052209 . 7081081 
Black . 2506024 . 8432432 
Hispanic .0325301 .0594595 
Married .8662651 . 1891892 
Unemployed in 1974 . 0863454 . 7081081 
Unemployed in 1975 .1 .6 
Real annual earnings in 1974 (pretreatment) 19428.75 2095.574 
Age in years squared 1323.53 717.3946 
Education in years squared 156.3161 111.0595 
Real annual earnings in 1974 squared 5.57e+08 2.81e+07 
Real annual earnings in 1975 squared 5.48e+08 1.27e+07 
Unemployed and black .0144578 .6 
Unemployed and hispanic .0036145 . 0324324 
Number of nonmissing values 

Real annual earnings in 1975 (pretreatment) 2,490 185 


The treated group, the second group listed, differs considerably from the 
control group, being disproportionately black (84%) with less than high 
school degree (71%) and unemployed in the pretreatment year 1975 (71%). 
Estimates of the effect of training should control for these differences. 


4.8.2 Before—after comparison 


One approach is to simply compute the change over time in the earnings for 
those receiving the treatment of training. This estimate is the mean 
difference in re78 and re75 for those with treat=1. From previous output, 
this equals (6349 — 1532) = 4817. 


We have 


. * Before--after comparison for the treated 
. generate badiff = re78 - re75 


. mean badiff if treat== 


Mean estimation Number of obs = 185 
Mean Std. err. [95% conf. interval] 
badiff 4817.09 608.4203 3616.713 6017.467 


There is a highly statistically significant effect of treatment. 


The limitation of this approach is that earnings in general may also have 
increased over these three years for the nontreated. 


4.8.3 Treatment—control comparison 


The outcome of interest is posttreatment earnings, re78, which were on 
average 6,349 in the treatment group and 21,554 in the control group. A 
simple estimate of the treatment effect is to compare average posttreatment 
earnings of treatment and control, leading to an estimate of 

6349 — 21554 = —15205. 


This estimate can be computed using the difference-in-means-test 
command ttest re78, by(treat) unequal. Equivalently, it can be 
computed as the coefficient of the treatment indicator treat in OLS 
regression of re78 on an intercept and treat. This is called a treatment— 
control comparison estimator because it mimics the analysis in an 
experimental setting. We have 


. * Treatment-control comparison (difference in means) 
. regress re78 treat, vce(robust) 


Linear regression Number of obs = 2,675 
F(1, 2673) 7 537.36 

Prob > F = 0.0000 

R-squared = 0.0609 

Root MSE = 15152 

Robust 

re78 | Coefficient std. err. t P>|t| [95% conf. interval] 

treat -15204.78 655.9143 -23.18 0.000 -16490.93 -13918.63 

_cons 21553.92 311.785 69.13 0.000 20942.56 22165.29 


The estimated treatment effect of — 15204.78 seems large and has a perverse 
sign. We next consider the DID estimator of the treatment effect. 


4.8.4 Difference-in-differences estimate 


The DID estimator of the average treatment effect on the treated (ATET) uses 
the difference across treatment and control groups of the change over time in 
average earnings. Average earnings in the treatment group grew by 

6349 — 1532 = 4817. Average earnings in the control group grew by 

21554 — 19063 = 2491. The DID estimate is then 4817 — 2491 = 2326,a 
much more plausible estimate than using the simple comparison of 
posttreatment earnings. 


Table 4.1 gives these computations. It includes an alternative ordering of 
the computation, first computing the difference in earnings between treated 
and control groups in each year and then computing the difference over 
years. The same result is obtained. 


Table 4.1. DID computation example 


Treated Not treated Diff over treatment 
Post (year = 2) 6,349 21,554 —15,205 
Pre (year = 1) 1,532 19,063 —17,531 
Change over time 4,817 2,491 DID = 4817 — 2491 = 2326 


or DID = —15205 — (—17531) = 2326 


The DID estimator can be shown to be equivalent to the estimate of a in 
the OLS regression 


rei = @+6 x dpost, +y x dtreat; + a x (dpost, x dtreat,;) + wit 
i = 1,...,2675, t= 75,78 


Here we have two periods of data, so there are now 2 x 2675 = 5350 
observations. rei, denotes earnings in the two periods, so it equals re;,75 in 
the pretreatment period and re:,78 in the posttreatment period, so the 
regression is one with 5, 350 earnings observations. The indicator variable 
dpost, equals one in the posttreatment period (1978) and equals zero in the 
pretreatment period (1975). The indicator variable dtreat; equals one if the 
individual is in the treated sample and equals zero for the untreated. The 
interaction term dpost, x dtreat; equals one for treated individuals in the 
posttreatment period; otherwise, it equals zero. 


The advantage of running this regression is that it provides a standard 
error for treatment effect, enabling statistical inference. Furthermore, by 
using the vce (cluster clustvar) option, we can control for clustering due 
to the regression having two observations per individual. 


To implement this regression, we need to convert the current dataset, 
which is in wide form, with a single observation having data for both years, 
to long form, with separate observations for each year. This is done using the 
reshape command (see section 8.10.2) as follows. 


* DID - first stack into panel format 
. generate id = _n 


. generate earnsi = re75 

. generate earns2 = re78 

. qui reshape long earns, i(id) j(year) 
. generate dpost = 0 


replace dpost = 1 if year== 
(2, 675 real changes made) 


. rename treat dtreat 


. qui generate dpostbydtreat = dpost*dtreat 


summarize id year dtreat earns dpost dpostbydtreat age, sep(0) 
Variable Obs Mean Std. dev. Min Max 


id 5,350 1338 772.2781 1 2675 

year 5,350 1.5 . 5000467 1 2 

dtreat 5,350 .0691589 . 2537478 (0) 1 

earns 5,350 19176.63 14839.18 (0) 156653 

dpost 5,350 .5 . 5000467 (0) 1 
dpostbydtr”t 5,350 . 0345794 . 1827292 (0) 1 
age 5,350 34.22579 10.49886 17 55 


There are twice as many observations; half are in the postperiod (variable 
dpost has mean 0.5); 185/2675 = 0.06916 are treated; and 
0.5 x 0.06916 = 0.03458 are both treated and in the postperiod. 


We then perform the DID regression: 


. * DID estimation - no controls 
. regress earns dpost dtreat dpostbydtreat, vce(cluster id) 


Linear regression Number of obs = 5,350 
F(3, 2674) = 913.63 
Prob > F = 0.0000 
R-squared = 0.0867 
Root MSE = 14185 
(Std. err. adjusted for 2,675 clusters in id) 
Robust 
earns | Coefficient std. err. t P>ltl [95% conf. interval] 
dpost 2490.585 217.2256 11.47 0.000 2064.637 2916.532 
dtreat -17531.28 360.6329 -48.61 0.000 -18238.43 -16824.13 
dpostbydtreat 2326.505 644.7524 3.61 0.000 1062.242 3590.769 
_cons 19063. 34 272.5572 69.94 0.000 18528.89 19597.78 


This produces an estimated treatment effect (the coefficient of 


dpostbydt reat) of 2326.51. The treatment effect on the treated is highly 


statistically significant, with p = 0.000. 


The regression formulation of DID allows the addition of control 


variables, such as age, to the regression. Thus, the intercept ¢ is replaced by 


x; B. We have 


. * DID estimation - with time-invariant controls 
. regress earns dpost dtreat dpostbydtreat age educ nodegree black hisp marr 
> u74 u75 re74 agesq educsq re74sq re75sq u74black u74hisp, vce(cluster id) 


Linear regression 


earns 


dpost 
dtreat 
dpostbydtreat 
age 

educ 
nodegree 
black 
hisp 
marr 

u74 

u75 

re74 
agesq 
educsq 
re74sq 
re75sq 
u74black 
u74hisp 
_cons 


In this illustrative example, there is no change in the coefficients of 


Coefficient 


2490.585 
-2458.445 
2326.505 
31.63905 
-411.9222 
209.4768 
-493.3401 
-15.52322 
657 . 2537 
7273.687 
-7980.635 
. 8554493 
-.9252285 
36 .02474 
-9.28e-06 
9.37e-06 
1055.096 
-273.6814 
1606.092 


(Std. err. 
Robust 
std. err. 
217.531 11 
490.5804 =5; 
645.6588 
89.52798 
195.8797 -2. 
372.9206 
268.5493 Six 
703.3069 -0. 
302.36 
1271.792 
809.0581 -9. 
.0853382 10. 
1.240772 -0. 
9.589573 
1.42e-06 -6. 
9.54e-07 
747.4911 
1479.979 -0. 
1764.901 (0) 


Number of obs = 5,350 
F(18, 2674) = 904.53 
Prob > F = 0.0000 
R-squared = 0.7250 
Root MSE = 7795.2 


adjusted for 2,675 clusters in id) 


t 


P>|t | [95% conf. 
0.000 2064.039 
0.000 -3420.4 
0.000 1060.464 
0.724 -143.912 
0.036 -796.0133 
0.574 -521.765 
0.066 -1019.925 
0.982 -1394.604 
0.030 64.3705 
0.000 4779.893 
0.000 -9567 .078 
0.000 .6881138 
0.456 -3.358197 
0.000 17.22101 
0.000 -.0000121 
0.000 7.49e-06 
0.158 -410.6231 
0.853 -3175.7 
0.363 -1854.617 


interval] 


2917.13 
-1496.489 
3592.546 
207.1901 
-27 .8311 
940.7186 
33.24521 
1363.557 
1250.137 
9767 . 482 
-6394.192 
1.022785 
1.50774 
54.82846 
-6.50e-06 
.0000112 
2520.815 
2628.337 
5066.801 


dpostbydtreat and dpost because the additional control variables are all 
time invariant so that Xit = X;. But the model fit is much better, with R2 
increasing from 0.087 to 0.725. For similar reasons, there is little change in 


the standard error for dpostbydtreat. 


Stata 17 introduced the didregress command for repeated cross- 
sections and the xtdidregress command for panel data. These commands, 
which can accommodate richer DID settings, are presented in section 25.6. 


For the current panel-data example we apply xtdidregress and obtain 


* xtdidregress command for DID - no controls 
. xtset id year 


Panel variable: id (strongly balanced) 
Time variable: year, 1 to 2 
Delta: 1 unit 


. xtdidregress (earns) (dpostbydtreat), group(id) time(year) 
Number of groups and treatment time 


Time variable: year 


Control: dpostbydtreat = 0 
Treatment: dpostbydtreat = 1 
Control Treatment 
Group 
id 2490 185 
Time 
Minimum 1 2 
Maximum 1 2 
Difference-in-differences regression Number of obs = 5,350 


Data type: Longitudinal 
(Std. err. adjusted for 2,675 clusters in id) 


Robust 
earns Coefficient std. err. t P>|t| [95% conf. interval] 
ATET 
dpostbydtreat 
(1 vs 0) 2326.505 644.6921 3.61 0.000 1062.36 3590.651 


Note: ATET estimate adjusted for panel effects and time effects. 


The variables provided in the command are, in order, 1) the outcome 
(and any control variables); 2) a treatment indicator equal to 1 if treated 
(here for treated individuals in the posttreatment period) and 0 otherwise; 
3) the group level at which the treatment is received (here the individual); 
and 4) the time variable. 


We obtain the same estimate of 2326.505. The default of the 
xtdidregress command is to obtain standard errors that cluster on the group 
variable, here id. The standard error of 644.6921 differs slightly from 
644.7524, obtained earlier using the regress command, because of slightly 
different degrees-of-freedom correction. 


The output also includes a table that gives the minimum and maximum 
of the first time that control individuals are observed (here time 1 for all 
2,490 controls) and the minimum and maximum of the first time that 
treatment occurs (here time 2 for all 185 treated). 


Further details on DID methods are presented in section 25.6. The 
didregress command adds individual-specific effects, and the 
xtdidregress command adds group-specific effects. The initial OLS 
regression of the current example excluded individual-specific effects yet led 
to the same estimate of the ATET because in the special case of a two-period 
binary treatment with no control variables, the individual-specific effects can 
be shown to be absorbed by inclusion of dpost and dtreat. The 
nogteffects option drops both group and time effects, so if this option is 
used, one needs to add in time effects as a control. The following yields an 
ATET of 2326.505. 


. * xtdidregress command for difference in differences - no controls 
. xtdidregress (earns year) (dpostbydtreat), group(id) time(year) nogteffects 


(output omitted ) 


An important complication considered in section 25.6 is that treatment is 
often at a grouped level rather than the individual level, as was the case in 
the current application. For example, suppose some villages received 
treatment and some did not. In that case, even if individual-level data are 
available, standard errors of estimates need to be computed by clustering at 
the village level. 


4.9 Additional resources 


The margins command is very powerful, and many more examples are 
provided in the Stata documentation. The main linear regression commands 
regress and regress postestimation are covered in [R]. The regression 
decomposition commands oaxaca and rego are community contributed. 
The policy prediction example in section 4.2.4 and the DID example in 
section 4.8 are examples of treatment-effects estimators that are covered in 
detail in chapters 24 and 25. 


4.10 Exercises 


1. Fit the model in section 4.2 on levels, except use all observations 
rather than those with just positive expenditures, and report robust 
standard errors. Predict medical expenditures. Use correlate to obtain 
the correlation coefficient between the actual and fitted value, and 
show that, upon squaring, this equals R2. Show that for the linear 
model margins with the ayax (*) option reproduces the OLS 
coefficients. Now, use margins with an appropriate option to obtain 
the average income elasticity of medical expenditures. 

2. Consider the example of section 4.2.4 and the third method that 
computes predictions at suppins==1 for all observations and at 
suppins==0 and then compares the average predictions. The text did 
this following OLS regression in levels and obtained an estimated 
treatment effect of $725. Repeat this example but with predictions 
from the log-linear model that correct for retransformation bias using 
Duan’s method. You should obtain an estimate of $2,048. 

3. Fit the model in section 4.2 on levels, using the first 2,000 
observations. Use these estimates to predict medical expenditures for 
the remaining 1,064 observations, and compare these with the actual 
values. Note that the model predicts very poorly in part because the 
data were ordered by totexp. 

4. Reconsider the out-of-sample prediction analysis of section 4.3. 
Modify the DGP such that the regression intercept for the last 10 
observations is 5, whereas it is fixed at 1 for the first 90 observations. 
Generate 10 out-of-sample predictions, and compare them with the 
case in which no such intercept shift occurs. 

5. Repeat the analysis of the previous question but under the assumption 
that the intercept remains unchanged in the prediction period but the 
slope parameter increases or decreases by 20%. Compare the results 
with the no-change scenario. 

6. Refer to the empirical example of section 4.3. Define two group- 
specific indicator variables, ai for the female group and a2 for the 
male group. Sort the sample observations by group; females are in 
group |. Using the full sample, run the earnings regression given in the 


section; include di and a2 group indicators, but exclude the intercept 
term. Test the hypothesis that the coefficients of di and a2 are equal. 
. This is a continuation of the previous question. Replace regressors in 
the regression by new ones generated by multiplying the original 
regressors by a1 and a2. The new regressors are group specific. 
Regress the earnings variable on the new generated group-specific 
regressors. Use the test command to test for the equality of the two 
sets of group-specific coefficients. 


Chapter 5 
Simulation 


5.1 Introduction 


Simulation by Monte Carlo experimentation is a useful and powerful 
methodology for investigating the properties of econometric estimators and 
tests. The power of the methodology derives from being able to define and 
control the statistical environment in which the investigator specifies the 
data-generating process (DGP) and generates data used in subsequent 
experiments. 


Monte Carlo experiments can be used to verify that valid methods of 
statistical inference are being used. An obvious example is checking a new 
computer program or algorithm. Another example is investigating the 
robustness of an established estimation or test procedure to deviations from 
settings where the properties of the procedure are known. 


Even when valid methods are used, they often rely on asymptotic 
results. We may want to check whether these provide a good approximation 
in samples of the size typically available to the investigators. Also, 
asymptotically equivalent procedures may have different properties in finite 
samples. Monte Carlo experiments enable finite-sample comparisons. 


This chapter deals with the basic elements common to Monte Carlo 
experiments: computer generation of random numbers that mimic the 
theoretical properties of realizations of random variables; commands for 
repeated execution of a set of instructions; and machinery for saving, 
storing, and processing the simulation output generated in an experiment to 
obtain the summary measures that are used to evaluate the properties of the 
procedures under study. We provide a series of examples to illustrate 
various aspects of Monte Carlo analyses. Further examples are given 
throughout the book. 


The chapter appears early in the book. Simulation is a powerful 
pedagogic tool for exposition and illustration of statistical concepts. At the 
simplest level, we can use pseudorandom samples to illustrate distributional 
features of artificial data. The goal of this chapter is to use simulation to 
study the distributional and moment properties of statistics in certain 


idealized statistical environments. Another possible use of the Monte Carlo 
methodology is to check the correctness of computer code. Many applied 
studies use methods complex enough that it is easy to make mistakes. 
Often, these mistakes could be detected by an appropriate simulation 
exercise. We believe that simulation is greatly underutilized, even though 
Monte Carlo experimentation is relatively straightforward in Stata. 


5.2 Pseudorandom-number generators 


Suppose we want to use simulation to study the properties of the ordinary 
least-squares (OLS) estimator in the linear regression model with normal 
errors. Then, at the minimum, we need to make draws from a specified 
normal distribution. The literature on (pseudo)-random-number generation 
contains many methods of generating such sequences of numbers. When we 
use packaged functions, we usually do not need to know the details of the 
method. Yet the match between the theoretical and the sample properties of 
the draws does depend upon such details. 


Stata has a suite of fast and easy-to-use random-number functions 
(generators). The suite includes the uniform, normal, binomial, gamma, and 
Poisson functions that we will use in this chapter, as well as several others 
that we do not use. The functions for generating pseudorandom numbers are 
summarized in help random number functions. 


Underlying generation of any random variable is a uniform random- 
number generator. In a very major change, Stata 14 introduced a new 
uniform random-number generator that will lead to different random 
numbers being generated. 


Thus, any results obtained using versions of Stata before version 14, and 
that used random numbers in any way, will change, unless appropriate 
commands are added to ensure random numbers are based on the original 
random-number generator. 


5.2.1 Uniform random-number generation 


The term “random-number generation” is an oxymoron. It is more accurate 
to use the term “pseudorandom numbers”. Pseudorandom-number generators 
use deterministic devices to produce long chains of numbers that mimic the 
realizations from some target distribution. For uniform random numbers, the 
target distribution is the uniform distribution from 0 to 1, for which any 
value between 0 and 1 is equally likely. Given such a sequence, methods 


exist for mapping these into sequences of nonuniform draws from desired 
distributions such as the normal. 


A standard simple generator for uniform draws uses the deterministic 
rule X; = (kKX;_1 +c)modm, j = 1,...,J, where the modulus operator a 
mod b forms the remainder when a is divided by b to produce a sequence of 
J integers between 0 and m — 1. Then R; = X; /m is a sequence of J 
numbers between 0 and 1. If computation is done using 32-bit integer 
arithmetic, then m = 2?! — 1, and the maximum periodicity is 
931 — 1 ~ 2.1 x 10°, but it is easy to select poor values of k, c, and Xo so 
that the cycle repeats much more often than that. For computation using 64- 
bit integer arithmetic, the maximum periodicity is 261 — 1 ~ 1.8 x 101°. 


Before version 14, the Stata function runiform() used the 32-bit KISS 
generator. Version 14 replaced this with the 64-bit Mersenne Twister 
generator. Users of version 14 and later can continue to use the older KISS 
generator, if desired, by giving the command set rng kiss32. 


The initial value for the cycle, Xo, is called the seed. The default is to 
have this set by Stata. For reproducibility of results, however, it is best to 
actually set the initial seed by using the set seed command or using the 
seed() option in, for example, vce (bootstrap). Then, if the program is 
rerun at a later time or by a different researcher, the same results will be 
obtained; see exercise | in section 5.8. 


To obtain and display one draw from the uniform, type 


. * Single draw of a uniform number 
. set seed 10101 


. scalar u = runiform() 


. display u 
. 30422325 


This number is internally stored at much greater precision than the eight 
displayed digits. 


The following code obtains 1,000 draws from the uniform distribution 
and then provides some details on these draws: 


. x 1,000 draws of uniform numbers 
. qui set obs 1000 


. set seed 10101 
. generate x = runiform() 
. list x in 1/5, clean 


x 
. 3042232 
. 5540206 
. 2794988 
. 2006274 
. 1246266 


. summarize x 


Variable Obs Mean Std. dev. Min Max 


oP WNP 


x 1,000 . 5031233 . 2922443 . 0005628 . 9996441 


The 1,000 draws have a mean of 0.503 and a standard deviation of 0.292, 
close to the theoretical values of 0.5 and 4/1/12 = 0.289. A histogram, not 
given, has 10 equal-width bins with heights that range from 0.8 to 1.2, close 
to the theory of equal heights of 1.0. 


The draws should be serially uncorrelated, despite a deterministic rule 
being used to generate the draws. To verify this, we create a time-identifier 
variable, t, equal to the observation number (_n), and we use tsset to 
declare the data to be time series with time-identifier t. We could then use 
the corrgram, ac, and pac commands to test whether autocorrelations and 
partial autocorrelations are zero. We more simply use pwcorr to produce the 
first three autocorrelations, where L2.x is the x variable lagged twice and the 
star (0.05) option puts a star on correlations that are statistically 
significantly different from 0 at level 0.05. 


* First three autocorrelations for the uniform draws 
. generate t = _n 


. tsset t 


Time variable: t, 1 to 1000 
Delta: 1 unit 


. pwcorr x L.x L2.x L3.x, star(0.05) 


x 1.0000 
L.x 0.0107 1.0000 
L2.x -0.0004 0.0109 1.0000 
L3.x -0.0210 0.0000 0.0108 1.0000 


The autocorrelations are low, and none are statistically different from 0 at the 
0.05 level. Uniform random-number generators used by packages such as 
Stata are, of course, subjected to much more stringent tests than these. 


5.2.2 Draws from normal 


For simulations of standard estimators such as OLS, nonlinear least squares, 
and instrumental variables (Iv), all that is needed are draws from the uniform 
and normal distributions because normal errors are a natural starting point 
and the most common choices of distribution for generated regressors are 
normal and uniform. 


The command for making draws from the standard normal has the 
following simple syntax: 


generate varname = rnormal() 
To make draws from N (m, s?), type the corresponding command 


generate varname = rnormal(m,s) 


Note that s > 0 is the standard deviation. The arguments m and s can be 
numbers or variables. 


Draws from the standard normal distribution also can be obtained as a 
transformation of draws from the uniform by using the inverse-probability 
transformation method explained in section 5.4.1, that is, by using 


generate varname = invnormal (runiform() ) 


This alternative method is computationally slower than using the rnormal () 
function. 


The following code generates and summarizes three pseudorandom 
variables with 1,000 observations each. The pseudorandom variables have 
distributions uniform(0, 1), standard normal, and normal with a mean of 5 
and a standard deviation of 2. 


. * Normal and uniform and uniform draws 
. clear 


. qui set obs 1000 


. set seed 10101 // Set the seed 
. generate uniform = runiform() // uniform(0,1) 
. generate stnormal = rnormal() // N(O,1) 


. generate norm5and2 = rnormal(5,2) 


. tabstat uniform stnormal norm5and2, stat(mean sd skew kurt min max) col(stat) 


Variable Mean SD Skewness Kurtosis Min Max 
uniform .5031233 .2922443 -.0264117 1.809826 .0005628 .9996441 
stnormal .0230095 1.023687 -.193405 3.12781 -4.119125 2.615001 


norm5and2 5.090733 1.983719 -.1014361 2.986503 -1.140269 11.52456 


The sample mean and other sample statistics are random variables; therefore, 
their values will, in general, differ from the true population values. As the 
number of observations grows, each sample statistic will converge to the 
population parameter because each sample statistic is a consistent estimator 
for the population parameter. 


For norm5and2, the sample mean and standard deviation are very close to 
the theoretical values of 5 and 2. Output from tabstat gives a skewness 
statistic of — 0.101 and a kurtosis statistic of 2.987, close to 0 and 3, 
respectively. 


For draws from the truncated normal, see section 5.4.4, and for draws 
from the multivariate normal, see section 5.4.5. 


5.2.3 Draws from t, chi-squared, F, gamma, and beta 


Stata’s library of functions contains a number of generators that allow the 
user to draw directly from a number of common continuous distributions. 
The function formats are similar to that of the rnormal () function, and the 
arguments can be a number or a variable. 


Let t(n) denote Student’s t distribution with n degrees of freedom, 
x?(m) denote the chi-squared distribution with m degrees of freedom, and 
F(h, n) denote the F distribution with h and n degrees of freedom. Draws 
from t(n) and y? (h) can be made directly by using the rt (df) and rchi2 (df 
) functions. We then generate F (h, n) draws by transformation because a 
function for drawing directly from the F distribution is not available. 


The following example generates draws from ¢(10), y?(10), and 
F (10,5). 


. * Student’s t, chi-squared, and F draws with constant degrees of freedom 
. Clear 


. qui set obs 2000 
. set seed 10101 


. generate xt = rt(10) // Result xt ~ t(10) 

. generate xc = rchi2(10) // Result xc ~ chisquared(10) 

. generate xfn = rchi2(10)/10 // Result numerator of F(10,5) 

. generate xfd = rchi2(5)/5 // Result denominator of F(10,5) 

. generate xf = xfn/xfd // Result xf ~ F(10,5) 

. summarize xt xc xf 

Variable Obs Mean Std. dev. Min Max 

xt 2,000 .0141702 1.144066 -5.294379 5.451671 
XC 2,000 10.10297 4.585638 1.403877 32.81162 
xf 2,000 1.687583 2.339381 .0818754 45.02815 


The ¢(10) draws have a sample mean and a standard deviation close to the 


theoretical values of 0 and 4/10/(10 — 2) = 1.118; the y?(10) draws have a 
sample mean and a standard deviation close to the theoretical values of 10 
and ,/20 = 4.472; and the F (10,5) draws have a sample mean close to the 
theoretical value of 5/(5 — 2) = 1.7. The sample standard deviation of 
2.339 differs from the theoretical standard deviation of 

/2 x 52 x 13/(10 x 32 x 1) = 2.687. This is because of randomness, and 
a much larger number of draws eliminates this divergence. 


Using rbeta (a, b), we can draw from a two-parameter beta with the 
shape parameters a,b > 0, mean a/(a + b), and variance 
ab/(a +b)?’ (a +b + 1). Using rgamma (a, b), we can draw from a two- 
parameter gamma with the shape parameter a > 0, scale parameter b > 0, 
mean qb, and variance ab?. 


5.2.4 Draws from binomial, Poisson, and negative binomial 


Stata functions also generate draws from some leading discrete distributions. 
Again, the arguments can be a number or a variable. 


Let Bin(n, p) denote the binomial distribution with positive integer n 
trials (n) and success probability p, 0 < p < 1, and let Poisson(m) denote 
the Poisson distribution with the mean or rate parameter m. The rbinomial ( 
n,p) function generates random draws from the binomial distribution, and 
the rpoisson(m) function makes draws from the Poisson distribution. 
Additionally, we use the runiformint (a,b) function, which generates 
equally likely integer values between integer a and integer b. 


We demonstrate these functions with an argument that is a variable so 
that the parameters differ across draws. 


Independent (but not identically distributed) draws from the binomial 


As illustration, we consider draws from the binomial distribution, when both 
the probability p and the number of trials n may vary over i. 


. * Discrete random variables: Binomial draws with n and p varying over trials 
. set seed 10101 


. generate pl = runiform() // Here pi~uniform(0,1) 
. generate trials = runiformint(1,10) // Here # trials varies btwn 1 & 10 


. generate xbin = rbinomial(trials,p1) // Draws from binomial(n,p1) 


. summarize p1 trials xbin 


Variable Obs Mean Std. dev. Min Max 
pi 2,000 . 4974725 . 2909053 . 0005628 .9996441 
trials 2,000 5.497 2.850097 1 10 


xbin 2,000 2.7325 2.479116 0 10 


The DGP setup implies that the number of trials n is a random variable with 
an expected value of 5.5 and that the probability p is a random variable with 
an expected value of 0.5. Thus, we expect that xbin has a mean of 

5.5 x 0.5 = 2.75, and this is approximately the case here. 


Independent (but not identically distributed) draws from Poisson 


For simulating a Poisson regression DGP, denoted y ~ Poisson(j1), we need 
to make draws that are independent but not identically distributed, with the 
mean - varying across draws because of regressors. 


We do so in two ways. First, let Hi equal xb=4+2*x with x=runiform(). 
Then 4 < u; < 6. Second, let Hi equal xb times xg where xg=rgamma (1,1), 
which yields a draw from the gamma distribution with a mean of 1 x 1 = 1 
and a variance of 1 x 12 = 1. Then w; > 0. In both cases, the setup can be 
shown to be such that the ultimate draw has a mean of 5, but the variance 
differs from 5 for the independent and identically distributed (1.1.d.) Poisson 
because in neither case are the draws from an identical distribution. We 
obtain 


. * Discrete random variables: Poisson and negative binomial draws i.i.d. 
. set seed 10101 


. generate xb= 4 + 2*runiform() 


. generate xg = rgamma(1,1) // Draw from gamma;E(v)=1 

. generate xbh = xb*xg // Apply multiplicative heterogeneity 

. generate xp = rpoisson(5) // Result xp ~ Poisson(5) 

. generate xp1 = rpoisson(xb) // Result xpi ~ Poisson(xb) 

. generate xp2 = rpoisson(xbh) // Result xp2 ~ NB(xb) 

. Summarize xg xb xp xpi xp2 

Variable Obs Mean Std. dev. Min Max 

xg 2,000 1.000795 1.011366 .0016444 10.49622 
xb 2,000 4.994945 5818107 4.001126 5.999288 
xp 2,000 5.001 2.210077 (0) 16 
xp1 2,000 4.977 2.353559 0 14 
xp2 2,000 4.928 5.509991 0 50 


The xb variable lies between 4 and 6, as expected, and the xg gamma 
variable has a mean and variance close to 1, as expected. For a benchmark 
comparison, we make draws of xp from Poisson(5), which has a sample 


mean close to 5 and a sample standard deviation close to \/5 = 2.236. Both 
xp1 and xp2 have means close to 5. In the case of xp2, the model has the 
multiplicative unobserved heterogeneity term xg, which is itself drawn from 
a gamma distribution with shape and scale parameter both set to 1. 
Introducing this type of heterogeneity means that xp2 is drawn from a 
distribution with the same mean as that of xpi, but the variance of the 
distribution is larger. More specifically, Var(xp2|xb) = xb*(1+xb), using 
results in section 20.2.2, leading to the much larger standard deviation for 
xp2. 


The second example makes a draw from the Poisson—gamma mixture, 
yielding the negative binomial distribution. The rnbinomial () function 
draws from a different parameterization of the negative binomial 
distribution. Thus, we draw from the Poisson—gamma mixture here and in 
section 20.2. 


Histograms and density plots 


For a visual depiction, it is often useful to plot a histogram or kernel density 
estimate of the generated random numbers. Here we do this for the draws xc 
from y?(10) and xp from Poisson(5). The results are shown in figure 5.1. 


. * Example of histogram and kernel density plus graph combine 
. qui twoway (histogram xc, width(1)) (kdensity xc, lwidth(thick)), 
> legend(off) title("Draws from chisquared(10)") saving(graphi.gph, replace) 


. qui twoway (histogram xp, discrete) (kdensity xp, lwidth(thick) w(1)), 
> legend(off) title("Draws from Poisson(mu=5)") saving(graph2.gph, replace) 


. graph combine graphi.gph graph2.gph, iscale(1.2) ysize(2.5) xsize(6.0) 


Draws from chisquared(10) Draws from Poisson(mu=5) 


in! 


Figure 5.1. y2(10) and Poisson(5) draws 


5.3 Distribution of the sample mean 


As an introductory example of simulation, we demonstrate the central limit 
theorem result, yN (ZN — u)/o — N(O, 1); that is, the sample mean is 
approximately normally distributed as N — oo. We consider a random 
variable that has the uniform distribution and a sample size of 30. 


We begin by drawing a single sample of size 30 of the random variable 

X that is uniformly distributed on (0, 1), using the runiform() random- 
number function. To ensure the same results are obtained in future runs of 
the same code or on other machines, we use set seed. We have 
. * Draw 1 sample of size 30 from uniform distribution 

clear all 
. qui set obs 30 

set seed 10101 


. generate x = runiform() 


To see the results, we use summarize and histogram. We have 


. * Summarize x and produce a histogram 
summarize x 


Variable Obs Mean Std. dev. Min Max 


x 30 . 380269 . 2706183 .014999 .886691 


. histogram x, width(0.1) xtitle("x°s from one sample") scale(1.2) 
(bin=9, start=.01499897, width=.1) 


Density 
1 


0 2 A 6 8 
x's from one sample 


Figure 5.2. Histogram for one sample of size 30 


The summary statistics show that 30 observations were generated and that 
for this sample z = 0.380. 


The histogram for this single sample of 30 uniform draws, given in 
figure 5.2, looks nothing like the bell-shaped curve of a normal because we 
are sampling from the uniform distribution. For very large samples, this 
histogram approaches a horizontal line with a density value of 1. 


To obtain the distribution of the sample mean by simulation, we redo the 
preceding 10,000 times, obtaining 10,000 samples of size 30 and 10,000 
sample means z. These 10,000 sample means are draws from the distribution 
of the sample-mean estimator. By the central limit theorem, the distribution 
of the sample-mean estimator has approximately a normal distribution. 
Because the mean of a uniform(0, 1) distribution is 0.5, the mean of the 
distribution of the sample-mean estimator is ().5. Because the standard 
deviation of a uniform(0, 1) distribution is 4/1/12 and each of the 10,000 
samples is of size 30, the standard deviation of the distribution of the 


sample-mean estimator is ,/(1/12)/30 = 0.0527. 


5.3.1 Stata program 


A mechanism for repeating the same statistical procedure 10,000 times is to 
write a program (see appendix A.2 for more details) that does the procedure 
once and use the simulate prefix to run the program 10,000 times. 


We name the program onesample and define it to be r-class, meaning 
that the ultimate result, the sample mean for one sample, is returned in r (). 
Because we name this result meanforonesample, it will be returned in 
r (meanforonesample). The program has no inputs, so there is no need for 
program arguments. The program drops any existing data on variable x, sets 
the sample size to 30, draws 30 uniform variates, and obtains the sample 
mean with summarize. The summarize command is itself an r-class 
command that stores the sample mean in r (mean) ; see section 1.6.1. The last 
line of the program returns r (mean) as the result meanforonesample. 


The program is 


. * Program to draw 1 sample of size 30 from uniform and return sample mean 
. program onesample, rclass 
1. drop _all 
qui set obs 30 
generate x = runiform() 
summarize x 
return scalar meanforonesample = r(mean) 


aon FF WD 


end 


To check the program, we run it once, using the same seed as earlier. We 
obtain 


* Run program onesample once as a check 
. set seed 10101 


. onesample 
Variable Obs Mean Std. dev. Min Max 
x 30 . 380269 . 2706183 .014999 .886691 
. return list 
scalars: 
r(meanforonesample) = .3802689793519676 


The results for one sample are exactly the same as those given earlier. 


5.3.2 The simulate prefix 


The simulate prefix runs a specified command # times, where the user 
specifies #. The basic syntax is 


simulate | exp_list] , reps(#) | options | : command 


where command is the name of the command, often a user-written program, 
and # is the number of simulations or replications. The quantities to be 
calculated and stored from command are given in exp list. We provide 
additional details on simulate in section 5.6.1. 


After simulate is run, the Stata dataset currently in memory is replaced 
by a dataset that has # observations, with a separate variable for each of the 
quantities given in exp _list. 


5.3.3 Central limit theorem simulation 


The simulate prefix can be used to run the onesample program 10,000 
times, yielding 10,000 sample means from samples of size 30 of uniform 
variates. We additionally used options that set the seed and suppress the 
output of a dot for each of the 10,000 simulations. We have 


. * Run program onesample 10,000 times to get 10,000 sample means 
. Simulate xbar = r(meanforonesample), seed(10101) reps(10000) nodots: onesample 


Command: onesample 
xbar: r(meanforonesample) 


The result from each sample, r (meanforonesample), is stored as the 
variable xbar. 


The simulate prefix overwrites any existing data with a dataset of 
10,000 “observations” on z. We summarize these values, expecting them to 
have a mean close to 0.5 and a standard deviation close to 0.0527. We also 
present a histogram overlaid by a normal density curve with a mean and 
standard deviation, which are those of the 10,000 values of z. We have 


. * Summarize the 10,000 sample means and draw histogram 
. Summarize xbar 


Variable Obs Mean Std. dev. Min Max 


xbar 10,000 . 499825 .0521404 . 306084 .6868161 


. histogram xbar, normal xtitle("Sample mean xbar from many samples") scale(1.2) 
(bin=40, start=.30608404, width=.0095183) 


Density 
4 
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Sample mean xbar from many samples 


Figure 5.3. Histogram of the 10,000 sample means, each from a 
sample of size 30 


The histogram given in figure 5.3 is very close to the bell-shaped curve 
of the normal. 


There are several possible variations on this example. Different 
distributions for x can be used with different random-number functions in 
the generate command for x. As sample size (set obs) and number of 
simulations (reps) increases, the results become closer to a normal 
distribution. 


5.3.4 The postfile command 


In this book, we generally use simulate to perform simulations. An 
alternative method is to use a looping command, such as forvalues, and, 
within each iteration of the loop, use post to write (or post) key results to a 
file that is declared in the post file command. After the loop ends, we then 
analyze the data in the posted file. 


The post file command has the following basic syntax, 


postfile postname newvarlist using filename |, every (#) replace | 


where postname is an internal filename, newvarlist contains the names of the 
variables to be put in the dataset, and filename is the external filename. 


The post postname (exp1) (exp2) ... command is used to write exp/, 
exp2, ... to the file. Each expression needs to be enclosed in parentheses. 


The postclose postname command ends the posting of observations. 


The postfile command offers more flexibility than simulate and, 
unlike simulate, does not lead to the dataset in memory being overwritten. 
For the examples in this book, simulate is adequate. 


5.3.5 Alternative central limit theorem simulation 


We illustrate the use of post file for the central limit theorem example. We 
have 


. * Simulation using postfile 

. set seed 10101 

. postfile sim_mem xmean using simresults, replace 
(file simresults.dta not found) 


. forvalues i = 1/10000 { 
2. drop _all 


3. qui set obs 30 

4. tempvar x 

53 generate `x” = runiform() 
6. qui summarize `x’ 

Ta post sim_mem (r(mean) ) 

8. 


. postclose sim_mem 


The post file command declares the memory object in which the results are 
stored, the names of variables in the results dataset, and the name of the 
results dataset file. In this example, the memory object will be named 

sim mem, xmean Will be the only variable in the results dataset file, and 
simresults.dta will be the results dataset file. (The replace option causes 
any existing simresults.dta to be replaced.) The forvalues loop (see 


section 1.8) is used to perform 10,000 repetitions. At each repetition, the 
sample mean, result r (mean) , is posted and will be included as an 
observation in the new xmean variable in simresults.dta. 


To see the results, we need to open simresults.dta and summarize. 


* See the results stored in simresults 
. use simresults, clear 


summarize 


Variable Obs Mean Std. dev. Min Max 


xmean 10,000 . 499825 .0521404 . 306084 .6868161 


The results are identical to those in section 5.3.3 with simulate due to using 
the same seed and same sequence of evaluation of random-number 
functions. 


The simulate prefix suppresses all output within the simulations. This is 
not the case for the forvalues loop, so the quietly prefix was used in two 
places in the code above. It can be more convenient to instead apply the 
quietly prefix to all commands in the entire forvalues loop. 


5.4 Pseudorandom-number generators: Further details 


In this section, we present further details on random-number generation that 
explain the methods used in section 5.2 and are useful for making draws 
from additional distributions. 


Commonly used methods for generating pseudorandom samples include 
inverse-probability transforms, direct transformations, accept—reject 
methods, mixing and compounding, and Markov chains. In what follows, we 
emphasize application and refer the interested reader to Cameron and 
Trivedi (2005, chap. 12) or numerous other texts for additional details. 


5.4.1 Inverse-probability transformation 


Let F(x) = Pr(X < x) denote the cumulative distribution function (c.d.f.) 
of a random variable x. Given a draw of a uniform variate r, 0 < r < 1, the 
inverse transformation x = F7! (r) gives a unique value of x because F (x) 
is nondecreasing in x. If r approximates well a random draw from the 
uniform, then z = F—! (r) will approximate well a random draw from F(x) 


A leading application is to the standard normal distribution. Then the 
inverse of the c.d.f., 


Pe) = Or) = [ see ae 


has no closed-form solution, and there is consequently no analytical 
expression for $~ t(x). Nonetheless, the inverse-transformation method is 
easy to implement because numerical analysis provides functions that 
calculate a very good approximation to ~'(z). In Stata, the function is 
invnormal (). Combining the two steps of drawing a random uniform variate 
and evaluating the inverse c.d.f., we have 


. * Inverse-probability transformation example: Standard normal 
. Clear all 


. qui set obs 2000 
. set seed 10101 


. generate xstn = invnormal (runiform()) 


This was presented in section 5.2.2 but is now superseded by the faster 
rnormal() function. 


As another application, consider drawing from the unit exponential, with 
c.d.f. F(x) = 1 — e77”. Solving r = 1 — e~* yields x = —In(1 — r) . If the 
uniform draw is, say, 0.640, then x = — In(1 — 0.640) = 1.022. With 
continuous monotonically increasing c.d.f., the inverse transformation yields 
a unique value of x, given r. The Stata code for generating a draw from the 
unit exponential illustrates the method: 


. * Inverse-probability transformation example: Unit exponential 
. generate xue = -ln(1-runiform()) 


For discrete random variables, the c.d.f. is a step function. Then the 
inverse is not unique, but it can be uniquely determined by a convention for 
choosing a value on the flat portion of the c.d.f., for example, the left limit of 
the segment. 


In the simplest case, we consider a Bernoulli random variable taking a 
value of 1 with a probability of p and a value of 0 with a probability of 1 — p 
. Then we take a uniform draw, u, and set y= 1 if u < p and y = 0 if u > p. 
Thus, if p = 0.6, we obtain the following: 


. * Inverse-probability transformation example: Bernoulli (p = 0.6) 
. generate xbernoulli = runiform() > 0.6 // Bernoulli(0.6) 


. summarize xstn xue xbernoulli 


Variable Obs Mean Std. dev. Min Max 
xstn 2,000 -.0114795 1.012011 -3.257087 3.384994 

xue 2,000 .9920891 .9782134 .0000276 6.411174 
xbernoulli 2,000 . 3885 -4875311 (0) 1 


This code uses a logical operator that sets y = 1 if the condition is met and 
y = 0 otherwise; see section 2.4.7. 


A more complicated discrete example is the Poisson distribution because 
then the random variable can potentially take an infinite number of values. 
The method is to sequentially calculate the c.d.f. Pr(Y < k) for k = 0,1, 2, 
.... Then, stop when the first Pr(Y < k) > u, where u is the uniform draw, 
and set y = k. For example, consider the Poisson with a mean of 2 and a 
uniform draw of 0.701. We first calculate Pr(y < 0) = 0.135 < u, then 
calculate Pr(y < 1) = 0.406 < u, then calculate Pr(y < 2) = 0.677 < u, 
and finally calculate Pr(y < 3) = 0.857. This last calculation exceeds the 
uniform draw of 0.701, so stop and set y = 3. Pr(Y < k) is computed by 
using the recursion Pr(Y < k) = Pr(Y <k—1)+Pr(Y =k). 


5.4.2 Direct transformation 


Suppose we want to make draws from the random variable Y, and from 
probability theory, it is known that Y is a transformation of the random 
variable X, say, Y = g(X). 


In this situation, the direct transformation method obtains draws of y by 
drawing X and then applying the transformation g(-). The method is clearly 
attractive when it is easy to draw X and evaluate g/(-). 


Direct transformation is particularly easy to illustrate for well-known 
transforms of a standard normally distributed random variable. A y?(1) 
draw can be obtained as the square of a draw from the standard normal; a 
x? (m) draw is the sum of m independent draws from y?(1); an F(m 1, m2) 
draw is (vı /m 1)/(v2/mz2), where vı and v2 are independent draws from 
x’ (mı) and y? (mz); and a t(m) draw is u/,/v/m, where u and v are 
independent draws from N(0, 1) and y? (m). 


5.4.3 Other methods 


In some cases, a distribution can be obtained as a mixture of distributions. A 
leading example is the negative binomial, which can be obtained as a 
Poisson—gamma mixture (see section 5.2.4). Specifically, if y| is 
Poisson(A) and \|z, a is gamma with a mean of u and a variance of a, 
then y|u, a is a negative binomial distributed with a mean of / and a 
variance of u + a2. This implies that we can draw from the negative 


binomial by using a two-step method in which we first draw (say, v) from 
the gamma distribution with a mean equal to 1 and then, conditional on v, 
draw from Poisson (uv). This example, using mixing, is used again in 
chapter 20. 


More advanced methods include accept—reject algorithms and 
importance sampling. Many of Stata’s pseudorandom-number generators use 
accept—reject algorithms. Type help random number functions for more 
information on the methods used by Stata. 


5.4.4 Draws from truncated normal 


In simulation-based estimation for latent normal models with censoring or 
selection, it is often necessary to generate draws from a truncated normal 
distribution. The inverse-probability transformation can be extended to 
obtain draws in this case. 


Consider making draws from a truncated normal. Then 
X ~ T Nap) (u, 07), where without truncation X ~ N (p, 07). With 
truncation, realizations of X are restricted to lie between left truncation 
point a and right truncation point p. 


For simplicity, first consider the standard normal case (u = 0, o = 1), 
and let Z ~ N (0, 1). Given the draw u from the uniform distribution, x is 
defined by the solution of the inverse-probability transformation equation 


_Pra<Z<a2)_ O(x)— O(a) 
Pria<Z<b)  6(b) — ®(a) 


Rearranging, P(x) = ®(a) + {®(b) — (a) yu so that solving for z, we 
obtain 


x = [P (a) + {(b) — ®(a)} u] 


To extend this to the general case, note that if Z ~ N (u, 07) then 
Z* = (Z — u)/o ~ N(O, 1), and the truncation points for 7*, rather than Z, 
are a* = (a — p)/o and b* = (b — yt) /c. Then, 


x = p+o0~'[G(a*) + {6(b*) — O(a*)} u] 


As an example, we consider draws from N(5, 47) for a random variable 
truncated to the range [0, 12]. 


. * Draws from truncated normal x ~ N(mu,sigma*2) in [a,b] 
. qui set obs 2000 


. set seed 10101 


. scalar a = 0 // Lower truncation point 
. scalar b = 12 // Upper truncation point 
. scalar mu = 5 // Mean 

. scalar sigma = 4 // Standard deviation 


. generate u = runiform() 
. generate w=normal ((a-mu) /sigma) +u* (normal ((b-mu) /sigma) -normal ((a-mu) /sigma) ) 
. generate xtrunc = mu + sigma*invnormal (w) 
. summarize xtrunc 
Variable Obs Mean Std. dev. Min Max 


xtrunc 2,000 5.421001 2.973021 0105123 11.98595 


Here there is more truncation from below because a is 1.25g from 4, 
whereas b is 1.75g from H, so we expect the truncated mean to exceed the 
untruncated mean. Accordingly, the sample mean is 5.421 compared with 
the untruncated mean of 5. Truncation reduces the range and, for most but 
not all distributions, will reduce the variability. The sample standard 
deviation of 2.973 is less than the untruncated standard deviation of 4. 


An alternative way to draw X ~ T Nga b) (H, g?) is to keep drawing from 
untruncated N (u, 07) until the realization lies in (a, b). This method will be 
very inefficient if, for example, (a,b) = (—0.01, 0.01) . A Poisson example 
is given in section 20.4. 


5.4.5 Draws from multivariate normal 


Making draws from multivariate distributions is generally more complicated. 
The method depends on the specific case under consideration, and inverse- 
transformation methods and transformation methods that work in the 
univariate case may no longer apply. 


However, making draws from the multivariate normal is relatively 
straightforward because, unlike most other distributions, linear combinations 
of normals are also normal. 


Direct draws from multivariate normal 


The drawnorm command generates draws from N (jz, ©) for the user- 
specified vector H and matrix X. For example, consider making 200 draws 
from a standard bivariate normal distribution with means of 10 and 20, 
variances of 4 and 9, and a correlation of 0.5 (so the covariance is 3). 


. * Bivariate normal example: Means 10, 20; variances 4, 9; and correlation 0.5 
. clear 


. qui set obs 1000 

. set seed 10101 

. matrix MU = (10,20) // MU is 2x 1 

. scalar sig12 = 0.5*sqrt (4*9) 

. matrix SIGMA = (4, sig12 \ sig12, 9) // SIGMA is 2 x 2 
. drawnorm y1 y2, means(MU) cov(SIGMA) 


. Summarize yl y2 


Variable Obs Mean Std. dev. Min Max 
yl 1,000 10.00834 2.026031 1.76175 16.18743 
y2 1,000 20.11815 3.012645 10.63777 30.54572 
. correlate yl y2 
(obs=1,000) 
yl y2 
yl 1.0000 
y2 0.4990 1.0000 


The sample means are close to 10 and 20, and the standard deviations are 
close to \/4 = 2 and ,/9 = 3. The sample correlation of 0.499 differs very 
slightly from 0.50. 


Transformation using Cholesky decomposition 


The method uses the result that if z ~ N (0, I), then 

x = p + Lz ~ N(p,LL’). It is easy to draw z ~ N (0, I) because z is just 
a column vector of univariate normal draws. The transformation method to 
make draws of x ~ N (u, X) evaluates x = ys + Lz, where the matrix L 
satisfies LL’ = X. More than one matrix L satisfies LE’ = X, the matrix 
analog of the square root of X. Standard practice is to use the Cholesky 
decomposition that restricts L to be a lower triangular matrix. Specifically, 
for the trivariate normal distribution, let Var(x) = © = Lzz’L’, where 

z ~ N(0,I;) and 


lı 0 0 
L= j| la l2 O0 
l31 I39 l33 


Then the following transformations of z’ = (z1 z2xz3) yield the desired 
multivariate normal vector x ~ N (p, X): 


zı = 4 + l1121 
T2 = plo + l2121 + l2222 


£3 = 3 + l3121 + l327z2 + 13323 
5.4.6 Draws using Markov chain Monte Carlo method 


In some cases, making direct draws from a target joint (multivariate) 
distribution is difficult, so the objective must be achieved in a different way. 
However, if it is also possible to make draws from the distribution of a 
subset, conditional on the rest, then one can create a Markov chain of draws. 
If one recursively makes draws from the conditional distribution and if a 
sufficiently long chain is constructed, then the distribution of the draws will, 
under some conditions, converge to the distribution of independent draws 
from the stationary joint distribution. This so-called Markov chain Monte 
Carlo method is now standard in modern Bayesian inference, presented in 
chapters 29 and 30. 


To be concrete, let Y = (Y1, Y2) have a bivariate density of 
f(Y) = f(%, Y2), and suppose that the two conditional densities f(Y,|Y2) 
and f(Y2|Y,) are known and that it is possible to make draws from these. 
Then it can be shown that alternating sequential draws from f(Y,|Y2) and 
f (Y2|¥1) converge in the limit to draws from f (Y1, Y2), even though in 
general f(Y1, Y2) A f(Vi|Y2)f(Y2|M) [recall that 
f(Y1, Yo) = f(Y1|Ye)f(¥2)]. The repeated recursive sampling from 
f(¥i|Y2) and f(Y2|Y1) is called the Gibbs sampler. 


We illustrate the Markov chain Monte Carlo approach by making draws 
from a bivariate normal distribution, f(Y1, Y2). Of course, when we use the 
drawnorm command, it is quite straightforward to draw samples from the 
bivariate normal. So the application presented is illustrative rather than 
practical. The relative simplicity of this method comes from the fact that the 
conditional distributions f(Y;|Y2) and f(Y2|Y1) derived from a bivariate 
normal are also normal. 


We draw bivariate normal data with means of (), variances of 1, and a 
correlation of p = 0.9. Then Y;|¥2 = y2 ~ N {py2, (1 — p”)} and 
Y2|¥i = yı ~ N { py, (1 — p°) }. Implementation requires looping that is 
much easier using matrix programming language commands. The following 
Mata code implements this algorithm by using commands explained in 
appendix B.2. 


. * MCMC example: Gibbs for bivariate normal mu’s=0 v’s=1 corr=rho=0.9 
. set seed 10101 


. clear all 


. set obs 1000 

Number of observations (_N) was 0, now 1,000. 
. generate double y1 =. 

(1,000 missing values generated) 


. generate double y2 =. 
(1,000 missing values generated) 
g 8 


. Mata: 
mata (type end to exit) 


sO = 10000 // Burn-in for the Gibbs sampler (to be discarded) 
s1 = 1000 // Actual draws used from the Gibbs sampler 
yi = J(s0+s1,1,0) // Initialize y1 
y2 = J(s0+s1,1,0) // Initialize y2 
rho = 0.90 // Correlation parameter 
: for(i=2; i<=s0+s1; i++) { 
> yili,1] = ((1-rho*2)*0.5)*(rnormal(1, 1, 0, 1)) + rho*y2[i-1,1] 
> y2li,1] = ((1-rho*2)*0.5)*(rnormal(1, 1, 0, 1)) + rho*y1[i,1] 
> } 
y = yl,y2 
y = y[l(s0+1),1 \ (s0+s1),.|] // Drop the burn-ins 
mean (y) // Means of y1, y2 
1 2 
1 . 0629698832 0577031161 
variance(y) // Variance matrix of y1, y2 
[symmetric] 
1 2 
1 8735151552 
2 . 7789225722 . 38794946316 
correlation(y) // Correlation matrix of y1, y2 
[symmetric] 
1 2 
1 1 
2 . 8886739931 1 
: end 


Many draws may be needed before the chain converges. Here we assume 
that 10,000 draws are sufficient, and we discard the first 10,000 draws; the 
remaining 1,000 draws are kept. In a real application, one should run careful 
checks to ensure that the chain has indeed converged to the desired bivariate 
normal. For the example here, the sample means of Y; and Y> are 0.0630 and 
0.0577, differing quite a bit from 0. Similarly, the sample variances of 0.874 
and 0.879 differ from 1. The correlation of 0.889 is close to the desired 0.9. 
A longer Markov chain or longer burn-in may be needed to generate 
numbers with desired properties for this example with relatively high p. 


Even given convergence of the Markov chain, the sequential draws of 
any random variable will be correlated. The output below shows that for the 
example here, the first-order correlation of sequential draws of y2 is 0.786. 


. mata: 


mata (type end to exit) 
y2 = y[l2,2 \ s1,2|] 


y2lagi = y[l1,2 \ (s1-1),21]] 
y2andlagi = y2,y2lagi 


: correlation(y2andlag1,1) // Correlation between y2 and y2 lag 1 
[symmetric] 
1 2 
1 1 
2 . 7864381826 1 


: end 


5.5 Computing integrals 


Some estimation problems may involve definite or indefinite integrals. In 
such cases, the integral may be numerically calculated. 


5.5.1 Quadrature 


For one-dimensional integrals of the form ie f (y)dy, where possibly 


a = —œ, b = oo, or both, Gaussian quadrature is the standard method. This 
approximates the integral by a weighted sum of m terms, where a larger m 
gives a better approximation and often even m = 20 can give a good 
approximation. The formulas for the weights are quite complicated but are 
given in standard numerical analysis books. 


Gauss—Hermite quadrature applies to the special case of unbounded 
integrals of form f a e-¥ r(y)dy. Gauss—Hermite quadrature approximates 
this integral by the sum }7", w;r(y;), where the user provides the number 
of evaluation points m, while the weights w; and evaluation points Y; are 
determined by the first m orthonormal Hermite polynomials. 


One-dimensional integrals often appear in regression models with a 
random intercept or random effect. In many nonlinear models, this random 
effect does not integrate out analytically. Most often, the random effect is 
normal so that integration is over (—oo, oo), and Gauss—Hermite quadrature 
is used. A leading example is the random-effects estimator for nonlinear 
panel models fit using various xt commands. 


In higher dimensions, Gauss—Hermite quadrature does not always 
provide an adequate approximation, and adaptive Gauss—Hermite quadrature 
may provide better approximation. The Stata me and gsem commands for 
nonlinear models offer as possible methods mean-variance adaptive Gauss— 
Hermite quadrature, mode-curvature adaptive Gauss—Hermite quadrature, 
and nonadaptive Gauss—Hermite quadrature. The quadrature methods use a 
Cholesky decomposition to reduce a multidimensional problem to a series of 
one-dimensional Gauss—Hermite quadratures. See [ME] meglm for a detailed 
discussion. 


For normal integrals, an alternative is to use a Laplacian approximation. 
For simplicity, consider the scalar case. Suppose f(y) has a maximum at Yo. 
Then f(y) ~ f (yo)(y — yo)?/2 by a second-order Taylor expansion. The 
integral [°° ef) dy is then approximated by 


ia f (yo)(y=y0)?/2dy — V2T/| f” (yo) lef . The me and gsem 


e 

— 00 
commands provide Laplacian approximation as an option; it is not as 
accurate as Gauss—Hermite quadrature but is faster. 


5.5.2 Monte Carlo integration 


Suppose the integral is of the form 


b 


E{h(Y)} = i h(y)o(y)ay 


a 


where g(y) is a density function. This can be estimated by the direct Monte 
Carlo integral estimate 


S 
E{h(Y)}=S* $ hly’) 
s=1 
where y!,..., y° are § independent pseudorandom numbers from the 


density g(y), obtained by using methods described earlier. This method 
works if E {h(Y )} exists and § — oo. 


This method can be applied to both definite and indefinite integrals. It 
has the added advantage of being immediately applicable to 
multidimensional integrals, provided we can draw from the appropriate 
multivariate distribution. It has the disadvantage that it will always provide 
an estimate, even if the integral does not exist. For example, to obtain E(Y ) 
for the Cauchy distribution, we could average § draws from the Cauchy. But 
this would be wrong because the mean of the Cauchy does not exist. 


As an example, we consider the computation of E[exp{—exp(Y)}] 
when y ~ N (0, 1). This is the integral: 


CO 


E [exp {— exp(Y)}] -f sae P {— exp(y)} exp (—y*/2) d 


It has no closed-form solution but can be proved to exist. We use the 
estimate 


S 
Ê [exp {— exp(Y)}] = 5 X. exp {~ exp(v’)} 


where y is the sth draw of § draws from the N (0, 1) distribution. 


This approximation task can be accomplished for a specified value of S, 
say, 100, by using the following code. 


. * Integral evaluation by Monte Carlo simulation with S=100 
. Clear all 


. qui set obs 100 

. set seed 10101 

. generate double y = runiform() 

. generate double gy = exp(-exp(y)) 
. qui summarize gy, meanonly 

. scalar Egy = r(mean) 


display "After 100 draws the MC estimate of Elexp{-exp(x)}] is " Egy 
After 100 draws the MC estimate of E[exp{-exp(x)}] is .42950987 


The Monte Carlo estimate of the integral is 0.430, based on 100 draws. 
5.5.3 Monte Carlo integration using different S 


It is not known in advance what value of § will yield a good Monte Carlo 
approximation to the integral. We can compare the outcome for several 


different values of § (including S = 100), stopping when the estimates 
stabilize. 


To investigate this, we replace the preceding code by a Stata program 
that has as an argument S, the number of simulations. The program can then 
be called and run several times with different values of S. 


The program is named mcintegration. The number of simulations is 
passed to the program as a named positional argument, numsims. This 
variable is a local variable within the program that needs to be referenced 
using quotes. The call to the program needs to include a value for numsims. 
Appendix A.2 provides the details on writing a Stata program. The program 
is r-class and returns results for a single scalar, E{g(y)}, where 


g(y) = exp {— exp(y) }- 


. * Program mcintegration to compute E{g(y)} numsims times 
. program mcintegration, rclass 


1. version 17 

2. args numsims // Call to program will include value for numsims 

3. drop _all 

4. qui set obs “numsims~ 

5. set seed 10101 

6. generate double y = rnormal(0) 

7. generate double gy = exp(-exp(y)) 

8. qui summarize gy 

9. scalar Egy = r(mean) 

10. scalar seEgy = r(sd)/sqrt(~numsims”) 

11. display "#sims:" %7.0f “numsims” " MC estimate is " Egy 
> " Standard error is " seEgy 

12. end 


The program is then run several times, for S = 10, 100, 1000, 10000, 
and 100000. 


. * Run program mcintegration S = 10, 100, ...., 100000 times 
. mcintegration 10 


#sims: 10 MC estimate is .26308196 Standard error is .08293082 
. Mcintegration 100 
#sims: 100 MC estimate is .36566172 Standard error is .02711514 


. mcintegration 1000 
#sims: 1000 MC estimate is .37957876 Standard error is .00837261 


. mcintegration 10000 
#sims: 10000 MC estimate is .38065078 Standard error is .00267433 


. mcintegration 100000 
#sims: 100000 MC estimate is .38162497 Standard error is .00084048 


The estimates of E{g(y)} stabilize as S — œo, but even with S — 105, the 
estimate changes in the third decimal place. The standard error of the 
estimate is declining with the number of simulations, but even with 100,000 
draws, the standard error of 0.00084048 equals 0.22% of the estimated value 
of 0.38162497. 


5.5.4 Halton and Hammersley sequences 


It can be better to base uniform draws on deterministic sequences such as 
Halton and Hammersley sequences. These lead to draws with fairly even 
coverage over the domain of the sampling distribution. Then simulated 
probabilities vary less over observations relative to those calculated with 
random draws. This is similar to deterministic evaluation of an integral over 
a specified grid. 


Furthermore, the sequential draws are negatively correlated. This has the 
advantage of reducing the variance of simulation-based estimates because 
Var(X + Y) equals Var(X) + Var(Y) if X and Y are independent but is 
less than Var(X) + Var(Y) if X and Y are negatively correlated. 


The Halton sequence based on the prime number 2 is constructed as 
follows. Divide the unit interval (0, 1) into two parts. The dividing point 1/2 
becomes the first element of the Halton sequence. Next, divide each part into 
two more parts. The dividing points, 1/4 and 3/4, become the next two 
elements of the sequence. Divide each of the four parts into two parts each, 
and continue to obtain the sequence {1/2, 1/4, 3/4, 1/8, 3/8,...}. For a P- 


dimensional Halton sequence, the starting bases are the first p primes greater 
than 1, specifically 2, 3,5, 7, 11, .... 


Halton sequences can be obtained using the Mata function halton(). 
The following example obtains sequences of length 200 for two variables 
and lists the first 8 values. 


. * Generate 2-dimensional Halton sequences of length 200 
. Clear 


. qui set obs 200 


. mata 


mata (type end to exit) 
h = halton(200, 2) 


st_addvar("float", ("haltoni", "halton2")) 
1 2 


ics 


st_store(., ("haltoni", "halton2") ,h[.,.]) 


: end 


. list haltoni halton2 in 1/8, clean 


haltont halton2 
.5 . 3333333 

.25 . 6666667 

.75 .1111111 
.125 . 4444444 
.625 .TTTTTT8 
.375 . 2222222 
.875 .5555556 
.0625 . 8888889 


ANOOFPWNHeE 


The two sequences have means and standard deviations close to 0.5 and 
4/1/12 = 0.289 for the uniform distribution. But the sequential draws are 
highly negatively correlated, as is illustrated for halton1. 


* Summary statistics and serial correlation for Halton sequences 
summarize haltoni halton2 


Variable Obs Mean Std. dev. Min Max 


halton1 200 . 4945898 . 2888488 . 0039063 .9921875 
halton2 200 . 4945679 . 2879916 .0041152 .9917696 


. qui generate laghalt1 = haltoni[_n-1] 


correlate haltoni laghalti 
(obs=199) 


haltoni laghalt1 


haltoni 1.0000 
laghalti -0.7185 1.0000 


The attraction of the Halton sequences is that they provide more even 
coverage of the sampling distribution than do the pseudorandom uniform 
draws. We have 


. * Compare deterministic Halton draws to pseudo-random uniform draws 
. set seed 10101 


. gen uniformi = runiform() 
. gen uniform2 = runiform() 


. qui graph twoway (scatter haltoni halton2), title("Halton sequences") 
> saving(graphi.gph, replace) 


. qui graph twoway (scatter uniformi uniform2), title("Uniform draws") 
> saving(graph2.gph, replace) 


. qui graph combine graphi.gph graph2.gph, ysize(2.5) xsize(6) iscale(1.5) 


Figure 5.4 shows that there is more even coverage for the Halton sequences 
(panel 1) than for the draws obtained using the usual uniform generator 
(panel 2). The coverage for the Halton sequences becomes even more 
uniform as the length of the sequence increases. 
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Figure 5.4. Halton sequence compared with uniform draws 


Halton sequences are best used for low-dimension sequences, say, less 
than 10 variables. A variation is the Hammersley sequence, which for 
sequences of length n sets the first variable to {1/2n, 
3/2n,...,(2n — 1)/2n} and sets the remaining variables to the Halton 
sequence values already described. 


5.6 Simulation for regression: Introduction 


The simplest use of simulation methods is to generate a single dataset and 
estimate the DGP parameter 0. Under some assumptions, if the estimated 
parameter g differs from @ for a large sample size, the estimator is probably 
inconsistent. We defer an example of this simpler simulation to section 5.6.4. 


More often, 9 is estimated from each of S-generated datasets, and the 
estimates are stored and summarized to learn about the distribution of g for a 
given DGP. For example, this approach is necessary if one wants to check the 
validity of a standard error estimator or the finite-sample size of a test. This 
approach requires the ability to perform the same analysis § times and to 
store the results from each simulation. The simplest approach is to write a 
Stata program for the analysis of one simulation and then use simulate to 
run this program many times. 


5.6.1 Simulation example: OLS with y? errors 


In this section, we use simulation methods to investigate the finite-sample 
properties of the OLS estimator with random regressors and skewed errors. If 
the errors are 1.1.d., the fact that they are skewed has no effect on the large- 
sample properties of the OLS estimator. However, when the errors are 
skewed, we will need a larger sample size for the asymptotic distribution to 
better approximate the finite-sample distribution of the OLS estimator than 
when the errors are normal. This example also highlights an important 
modeling decision: when y is skewed, we sometimes choose to model 
E(In y|x) instead of E'(y|x) because we believe the disturbances enter 
multiplicatively instead of additively. This choice is driven by the 
multiplicative way the error affects the outcome and is independent of the 
functional form of its distribution. As illustrated in this simulation, the 
asymptotic theory for the OLS estimator works well when the errors are 
1.1.d. from a skewed distribution. 


We consider the following DOP, 


y = bı + bortu; u~x7(1)-1, z~yx?(l) 


where 3; = 1, 62 = 2, and the sample size N = 150. For this DGP, the error 
u is independent of the regressor x (ensuring consistency of oLs) and has a 
mean of 0, variance of 2, skewness of ,/8, and kurtosis of 15. By contrast, a 
normal error has a skewness of 0 and a kurtosis of 3. 


We wish to perform 1,000 simulations, where in each simulation we 
obtain parameter estimates, standard errors, t-values for the ¢ test of 
Ho: 82 = 2, and the outcome of a two-sided test of Ho at level 0.05. 
Inference throughout this example is based on default OLs standard errors. 


Two of the most frequently changed parameters in a simulation study are 
the sample size and the number of simulations. Thus, these two parameters 
are almost always stored in something that can easily be changed. We use 
global macros. In the output below, we store the number of observations in 
the global macro numobs and the number of repetitions in the global macro 
numsims. We use these global macros in the examples in this section. 


. * Define global macros for sample size and number of simulations 
. global numobs 150 // Sample size N 


. global numsims "1000" // Number of simulations 


We first write the chi2data program, which generates data on y, 
performs OLS, and returns Bos SẸ, t2 = (Bo — 2)/ 83a rejection indicator 
r2 = 1 if |t2| > to.025(148), and the p-value for the two-sided ¢ test. The 
chi2data program is an r-class program, so these results are returned in r () 
using the return command. 


* Program for finite-sample properties of OLS 
. program chi2data, rclass 


1. version 17 
2 drop _all 
3 set obs $numobs 
4. generate double x = rchi2(1) 
5; generate y = 1 + 2*x + rchi2(1)-1 // Demeaned chi^2 error 
6 regress y X 
7 return scalar b2 =_b[x] 
8 return scalar se2 = _se[x] 
9. return scalar t2 = (_bl[x]-2)/_se[x] 
10. return scalar r2 = abs(return(t2))>invttail ($numobs-2, .025) 
11. return scalar p2 = 2*ttail($numobs-2,abs(return(t2) )) 


12. end 


This code is used to produce all the statistics analyzed below, including 
the ¢ statistic plotted in figure 5.5. Monte Carlo studies of tests typically 
consider test rejection rates and p-values, rather than the test statistic itself. 
In that case, we can instead use the test command, which computes an F 
statistic with the same p-value, and reject at 5% if p < 0.05. Subsequent 
Monte Carlos in this section and, for example, section 11.7 use the test 


command. 


The following output illustrates that test and the manual calculations 
yield the same p-value. 


* An F test gives same p-value as the manual t test in program chi2data 


set seed 1010 
. qui chi2data 
. return list 


scalars: 


. qui test x=2 
. return list 


scalars: 


1 


x(p2) 
r(r2) 
r(t2) 
r(se2) 


r(b2) = 


r(drop) = 


r(df_r) 
r(F) 
r (df) 
r(p) 


= .3076722413709826 
= 0 

1.023647108363717 
.0741344926219221 
2.075887559002442 


0 

= 148 

= 1.0478534024614 

= J 

= .3076722413709826 


We use simulate to call chi2data $numsims times and to store the 
results; here $numsims = 1000. The current dataset is replaced by one with 
the results from each simulation. These results can be displayed by using 
summarize, Where obs in the output refers to the number of simulations and 
not the sample size in each simulation. 


. * Simulation for finite-sample properties of OLS 
. Simulate b2f=r(b2) se2f=r(se2) t2f=r(t2) reject2f=r(r2) p2f=r(p2), 
> seed(10101) reps($numsims) nolegend nodots: chi2data 


. summarize b2f se2f reject2f 


Variable Obs Mean Std. dev. Min Max 
b2f 1,000 2.003871 .0878981 1.788395 2.504547 
se2f 1,000 .0841427 .0180844 . 0388639 . 1792164 
reject2f 1,000 05 . 218054 (0) 1 


The summarize output indicates that 1) the mean of the point estimates is 
very close to the true value of 2, 2) the standard deviation of the point 
estimates is close to the mean of the standard errors, and 3) the rejection rate 
of 0.05 coincides with the size of 0.05. The exact result for the rejection rate 
is coincidental—with a different seed, we do not get exactly 0.05. 


Further information on the distribution of the results can be obtained by 
using the mean, the summarize, detail, and the kdensity commands. 


5.6.2 Interpreting simulation output 


We consider in turn unbiasedness of Bos correctness of the standard error 
formula for $3,, distribution of the ¢ statistic, and test size. 


Because interest lies in the averages over the simulations of the various 
statistics, we use the mean command. 


. * Report results for simulation averages 
. mean b2f se2f reject2f 


Mean estimation Number of obs = 1,000 


Mean Std. err. [95% conf. interval] 


b2f 2.003871 .0027796 1.998416 2.009325 
se2f 0841427 .0005719 . 0830204 . 0852649 
reject2f .05 . 0068955 . 0364687 . 0635313 


Unbiasedness of estimator 


The average of 3, over the. 1,000 estimates, Bo = (1/1000) eee 8.» is the 
simulation estimate of E(B). Here 8, — 2.904 (see the mean of b2£) is very 
close to the DGP value 8> = 2.0, suggesting that the estimator is unbiased. 


However, this comparison should account for simulation error. From the 
mean command, the simulation yields a 95% confidence interval for E(B.) 


of [1.998, 2.009]. This interval is quite narrow and includes 2.0, so we 
conclude that E(B) is unbiased. Note, however, that if Bo is unbiased, then 


in 5% of such simulation exercises, this 95% confidence interval will not 
include the DGP value of 8. 


Many estimators, particularly nonlinear estimators, are biased in finite 
samples. Then exercises such as this can be used to estimate the magnitude 
of the bias in typical sample sizes. If the estimator is consistent, then any 
bias should disappear as the sample size N goes to infinity. 


Standard errors 


The variance of Bo over the 1,000 estimates, 
s? = (1/999) Sea ee 82), is the simulation estimate of 


Q _ pe a 
ia Var(62), the variance of 63. 
Here the average of the standard errors across the simulations is 
se( Bo ) = 0.084 (see the mean of se2£), with 95% confidence interval for se 
) 


(Bo of (0.083, 0.085]. Because this interval does not include $3, = 0.088 


(see the standard deviation of b2£), there is evidence that Se (3) is slightly 
biased for 7ĝ,, so the asymptotic distribution is not perfectly approximating 
the finite-sample distribution. Note that for the current DGP, {se()}? is 
unbiased for 7 A» but this does not imply that upon taking the square root, se 
(3) is unbiased for 73,. 


t statistic 


Because we impose looser restrictions on the DGP, ¢ statistics are not exactly 

t distributed, and z statistics are not exactly z distributed. However, the 
extent to which they diverge from the reference distribution disappears as 

the sample size increases. The output below generates the graph in 

figure 5.5, which compares the density of the ¢ statistics with the t(148) 
distribution. 

. * Plot the density of the t statistics and compare with theoretical t(148) 


. kdensity t2f, student(148) legend(off) scale(1.2) 
> title("Density of the {it:t} statistics versus {it:t}(148)") 


Density of the t statistics versus [(148) 


NN 


-2 0 2 4 6 
r(t2) 
kernel = epanechnikov, bandwidth = 0.2166 


Figure 5.5. ¢ statistic density compared with theoretical t(148) 


Though the graph highlights some differences between the finite-sample and 
the asymptotic distributions, the divergence between the two does not appear 
to be great. From output not given, centile t2f, centile(2.5 97.5) 
yields values — 1.732 and 2.041 compared with — 1.976 and 1.976 for the 
t(148) distribution. 


Rather than focus on the distribution of the ¢ statistics, we instead focus 
on the size of tests or coverage of confidence intervals based on these 
statistics. 


Test size 


The size of the test is the probability of rejecting Ho when Hp is true. 
Because the DGP sets G2 = 2, we consider a two-sided test of Ho: bg = 2 
against Ha: G2 Æ 2. The level or nominal size of the test is set to 0.05, and 
the ¢ test is used. The proportion of simulations that leads to a rejection of 
Ho is known as the rejection rate, and this proportion is the simulation 
estimate of the true test size. 


Here the estimated rejection rate is 0.050 (see the mean of reject 2¢£). 
The associated 95% confidence interval (from mean reject 2f) is [0.036, 
0.064], which is quite wide but includes 0.05. The width of this confidence 
interval is partially a result of having run only 1,000 repetitions and partially 
an indication that, with 150 observations, the true size of the test can differ 
from the nominal size. When this simulation is rerun with 10,000 repetitions, 
the estimated rejection rate is 0.048, and the confidence interval is [0.044, 
0.052]. 


The simulation results also include the variable p2£, which stores the p- 
values of each test. If the t(148) distribution is the correct distribution for 
the ¢ test, then p2f should be uniformly distributed on (0, 1). A histogram, 
not shown, reveals this to be the case. 


More simulations are needed to accurately measure test size (and power) 
than are needed for bias and standard error calculations. For a test with 
estimated size a based on § simulations, a 95% confidence interval for the 
true size is a + 1.96 x 4 /a(1 — a)/ S. For example, if a = 0.06 and 
S = 10000, then the 95% confidence interval is [0.055, 0.065]. A more 


detailed Monte Carlo experiment for test size and power is given in 
section 11.7. 


Number of simulations 


Ideally, 10,000 simulations or more would be run in reported results, but this 
can be computationally expensive. With only 1,000 simulations, there can be 
considerable simulation noise, especially for estimates of test size (and 
power). 


5.6.3 Variations 


The preceding code is easily adapted to other problems of interest. 


Different sample size and number of simulations 


Sample size can be changed by changing the global macro numobs. Many 
simulation studies focus on finite-sample deviations from asymptotic theory. 
For some estimators, most notably Iv with weak instruments, such deviations 
can occur even with samples of many thousands of observations. 


Changing the global macro numsims can increase the number of 
simulations to yield more precise simulation results. 


Test power 


A type II error occurs if a test fails to reject Ho when Ho is false. The power 
is one minus the probability of making this error, so 

power = Pr(reject Ho|Ho false). The larger the difference between the 
tested value and the true value, the greater the power and the rejection rate. 
The example below modifies chi2data to estimate the power of a test 
against the false null hypothesis that G2 = 2; instead, Gy = 2.1. 


* Program for finite-sample properties of OLS: power 
. program chi2datab, rclass 
version 17 


p 


2. drop _all 

3. set obs $numobs 

4. generate double x = rchi2(1) 

5; generate y = 1 + 2.1*x + rchi2(1)-1 // Demeaned chi^2 error 
6. regress y X 

7. return scalar b2 =_b[x] 

8. return scalar se2 =_se[x] 

9. test x=2 

10. return scalar r2 = (r(p)<.05) 

11. end 


Below, we use simulate to run the simulation 1,000 times, and then we 
summarize the results. 


* Power simulation for finite-sample properties of OLS 
simulate b2f=r(b2) se2f=r(se2) reject2f=r(r2), 
> seed(10101) reps($numsims) nolegend nodots: chi2datab 


. mean b2f se2f reject2f 


Mean estimation Number of obs = 1,000 
Mean Std. err. (95% conf. interval] 

b2f 2.103871 .0027796 2.098416 2.109325 

se2f .0841427 .0005719 . 0830204 .0852649 
reject2f . 237 .0134541 .2105985 . 2634015 


The sample mean of reject2f provides an estimate of the power. The 
estimated power is 0.237, which is not high. Increasing the sample size or 
the distance between the tested value and the true value will increase the 
power of the test. 


A useful way to incorporate power estimation is to define the 
hypothesized value of G2 to be an argument of the program chi2datab. This 
is demonstrated in the more detailed Monte Carlo experiment in 
section 11.7. 


Different error distributions 


We can investigate the effect of using other error distributions by changing 
the distribution used in chi2data. For linear regression, the ¢ statistic 


becomes closer to ¢ distributed as the error distribution becomes closer to 
i.i.d. normal. For nonlinear models, the exact finite-sample distribution of 
estimators and test statistics is unknown even if the errors are 1.1.d. normal. 


The example in section 5.6.2 used draws of both regressors and errors 
that differed in each simulation. This corresponds to simple random 
sampling where we jointly sample the pair (y, x), especially relevant to 
survey data where individuals are sampled, and we use data (y, x) for the 
sampled individuals. An alternative approach is that of fixed regressors in 
repeated trials, especially relevant to designed experiments. Then we draw a 
sample of x only once, and we use the same sample of x in each simulation 
while redrawing only the error u (and hence y). In that case, we create 
fixedx.dta, which has 150 observations on a variable, x, that is drawn from 
the y*(1) distribution, and we replace lines 2—4 of chi2data by typing use 


fixedx, clear. 
5.6.4 Estimator consistency or inconsistency 


Establishing estimator consistency or inconsistency requires less coding 
because we need to generate data and obtain estimates only once, with a 
large Ny, and then compare the estimates with the Dap values. 


For the preceding DGP and estimator with N — 10000, we obtain 


. * Consistency of OLS in preceding simulation setup 
. clear 


. qui set obs 10000 

. set seed 10101 

. generate double x = rchi2(1) 

. generate y = 1 + 2*x + rchi2(1)-1 // Demeaned chi*2 error 


. regress y x, noheader 


y | Coefficient Std. err. t P>|t| [95% conf. interval] 
x 2.003907 . 0099924 200.54 0.000 1.98432 2.023494 
_cons .9716907 .0169879 57.20 0.000 . 938391 1.00499 


The intercept and slope coefficients are very close to their DGP values of 1.0 
and 2.0. As expected, the OLS estimator appears to be consistent. 


Next we consider a classical errors-in-variables model of measurement 
error. Here not only the inconsistency of the OLS estimator is known but also 
the magnitude of the inconsistency, so we have a benchmark for comparison. 


The DGP considered is 


y= Ba*+u; x*~ N(0,9); u~ N(0,1) 
x=zxgz* +v; v~ N(0,1) 


OLS regression of y on g* consistently estimates 8. However, only data 
on x rather than 7* are available, so we instead obtain 8 from an OLS 


regression of y on z. 


It is a well-known result that then 8 is inconsistent, with a downward 
bias, s8, where s = o2 /(0o2 + 02.) is the noise-signal ratio; see Cameron 
and Trivedi (2005, 903). For the DGP under consideration, this ratio is 
1/(1 +9) = 0.1, so plim 8 = 8 — 88 = 1 — 0.1 x 1 = 0.9- 


The following simulation checks this theoretical prediction, with sample 
size set to 10,000. We use drawnorm to jointly draw (x*, u, v), though we 
could have more simply made three separate standard normal draws. We set 


p=. 


. * Inconsistency of OLS in errors-in-variables model (measurement error) 
. Clear 


. qui set obs 10000 

. set seed 10101 

. matrix mu = (0,0,0) 

. Matrix sigmasq = (9,0,0\0,1,0\0,0,1) 

. drawnorm xstar u v, means(mu) cov(sigmasq) 

. generate y = 1*xstar +u  // DGP for y depends on xstar 
. generate x = xstar + v // x is mismeasured xstar 


. regress y x, noconstant noheader 


Coefficient Std. err. t P>|t | [95% conf. interval] 


X . 899366 .0043202 208.17 0.000 . 8908974 . 9078346 


The OLS estimate is very precisely estimated, given the large sample size. 
The estimate of 0.8994 clearly differs from the DGP value of 1.0, so OLS is 
inconsistent. Furthermore, the simulation estimate essentially equals the 
theoretical value of 0.9. 


5.6.5 Simulation with endogenous regressors 


Endogeneity is one of the most frequent causes of estimator inconsistency. A 
simple method to generate an endogenous regressor is to first generate the 
error u and then generate the regressor x to be the sum of a multiple of u and 
an independent component. 


We adapt the previous DGP as follows: 
y= pı + p2x +u; u~ N(0,1); 
xr=z+u+v; z~ N(0,1); v~ N(0,1) 


We set 3, = 10 and 35 = 2. For this DGP, the correlation between x and 
u equals 0.5. We let N = 150. 


The following program generates the data and estimates: 


. * Program for OLS with endogenous regressor 
. Clear 


. program endogreg, rclass 


1. version 17 

2 drop _all 

3 set obs $numobs 

4. generate u = rnormal(0) 

5. generate z = rnormal() 

6 generate v = rnormal (0) 

7 generate x = zZz +u +v // Endogenous regressor 
8 generate y = 10 + 2*x + u 

9. regress y X 

10. return scalar b2 =_b[x] 

11. return scalar se2 = _se[x] 
12. test x=0 

13. return scalar r2 = (r(p)<.05) 


14. end 


We run the simulations and summarize the results. 


. * Simulation for OLS with endogenous regressor 
. Simulate b2r=r(b2) se2r=r(se2) reject2r=r(r2), 
> seed(10101) reps($numsims) nolegend nodots: endogreg 


. mean b2r se2r reject2r 


Mean estimation Number of obs = 1,000 

Mean Std. err. [95% conf. interval] 

b2r 2.334661 .0012874 2.332134 2.337187 

se2r .0386734 . 0000976 . 038482. . 0388649 
reject2r 1 0 


These 1,000 repetitions indicate that for N = 150, the OLS estimator is 
biased by about 17%, the reported standard error is about 5% too small, 
0.03867 / (0.001287 x v 1000) = 0.9502, and we always reject the true null 
hypothesis that G2 = 2. 


By setting N large, we could also show that the OLs estimator is 
inconsistent with a single repetition. As a variation, we could instead 
estimate by Iv, with z a valid instrument for x given this DGP, and verify that 
the Iv estimator is consistent. 


5.7 Additional resources 


The key reference for random-number functions is help random number 
functions. This covers most of the generators illustrated in this chapter and 
several other standard ones that have not been used. Note, however, that the 
rnbinomial (x, p) function for making draws from the negative binomial 
distribution has a different parameterization from that used in this book. 
The key Stata commands for simulation are simulate and postfile (see 
[R] simulate and [P] postfile). The simulate prefix requires first collecting 
commands into a program; see [P] program. 


A standard book that presents algorithms for random-number generation 
is Press et al. (2007). Cameron and Trivedi (2005) discuss random-number 
generation and present a Monte Carlo study. Simulations are used 
throughout the current book; see especially the chapters on Bayesian 
methods, testing methods, and bootstrap methods. 


5.8 Exercises 


l; 


Give the commands set obs 100 and generate u=runiform() and 
summarize. Then, give command clear and repeat the initial 
commands. Did you get the same result? If not, explain why, and adapt 
your code to get the same results. 


. Using the normal generator, generate a random draw from a 50-50 


scale mixture of N(1,1) and N(1, 37) distributions. Repeat the 
exercise with the N(1, 37) component replaced by N(3, 1). For both 
cases, display the features of the generated data by using a kernel 
density plot. 


. Generate 1,000 observations from the F’(5, 10) distribution. Use 


rchi2() to obtain draws from the y?(5) and the y?(10) distributions. 
Compare the sample moments with their theoretical counterparts. 


. Make 1,000 draws from the N (6, 27) distribution by making a 


transformation of draws from N (0, 1) and then making the 
transformation Y = u + oZ. 


. Generate 1,000 draws from the t(6) distribution, which has a mean of 


0 and a variance of 4. Compare your results with those from 
exercise 4. 


. Generate a large sample from the N (u = 1,0? = 1) distribution and 


estimate g / u, the coefficient of variation (cv). Verify that the sample 
estimate is a consistent estimate. 


. Generate a draw from a multivariate normal distribution, 


N(p, = = LL’), with u’ = [0x00] and 


1 0 0 1 1 0 
L=|1 V3 0 |,x S=|1 43 
0 V3 v6 0 3 9 


using transformations based on this Cholesky decomposition. Compare 
your results with those based on using the drawnorm command. 


. Let s denote the sample estimate of o and x denote the sample 


estimate of u. The cv o/u, which is the ratio of the standard deviation 


10. 


to the mean, is a dimensionless measure of dispersion. The asymptotic 
distribution of the sample cv s/Z is 

Nlo/p, (N — 2)-1/?(o/p)?{0.5 + (a /2)?}] ; see Miller (1991). For 
N = 25, using either simulate or post file, compare the Monte Carlo 
and asymptotic variance of the sample cv with the following 
specification of the DGP: y ~ N (u, 07) with three different values of 
cv = 0.1, 0.33, and 0.67. 


. It is suspected that making draws from the truncated normal using the 


method given in section 5.4.4 may not work well when sampling from 
the extreme tails of the normal. Using different truncation points, 
check this suggestion. 

Repeat the example of section 5.6.1 (OLS with y? errors), now using 
the postfile command. Use post file to save the estimated slope 
coefficient, standard error, the ¢ statistic for Ho: 6 = 2, and an 
indicator for whether Ho is rejected at the 0.05 level in a Stata file 
named simresults. The template program is as follows: 


* Postfile and post example: repeat OLS with chi-squared errors example 


clear 
set seed 


10101 


program simbypost 
version 17 
tempname simfile 


postf 
quiet 


ile “simfile”™ b2 se2 t2 reject2 p2 using simresults, replace 


ly { 
forvalues i = 1/1000 { 
drop _all 
set obs 150 
generate x = rchi2(1) 
generate y = 1 + 2*x + rchi2(1) - 1 // demeaned chi^2 error 


} 
} 
postc 
end 
simbypost 
use simre 
summarize 


regress y x 

scalar b2 =_b[x] 

scalar se2 = _se[x] 

scalar t2 = (_b[x]-2)/_se[x] 

scalar reject2 = abs(t2) > invttail(1000-2, .025) 
scalar p2 = 2*ttail(1000-2,abs(t2) ) 

post ~“simfile” (b2) (se2) (t2) (reject2) (p2) 


lose ~simfile” 


sults, clear 


Chapter 6 
Linear regression with correlated errors 


6.1 Introduction 


Chapter 3 presented ordinary least-squares (OLS) estimation with inference 
based on either heteroskedasticity-robust or cluster—robust standard errors. 
Now we go further and introduce feasible generalized least-squares (FGLS) 
estimation based on a richer specification of the model for the error. This 
can lead to more efficient estimation than OLS estimation, leading to smaller 
standard errors, narrower confidence intervals, and larger ¢ statistics. 


Much of the chapter focuses on the case of model errors that are 
clustered, with correlation for observations in the same cluster and 
independence for observations in different clusters. This is a common 
complication in microeconometric studies with cross-sectional data, the 
focus of this chapter, and with panel data, presented in chapter 8. Failure to 
appropriately control for clustered errors can lead to very erroneous 
statistical inference. 


There are several ways to proceed when model errors are clustered. A 
common approach is to simply use the OLS estimator and forgo any of the 
potential efficiency gains of alternative estimators. The random-effects (RE) 
estimator introduces a purely random cluster-specific effect and estimates 
by FGLS. The fixed-effects (FE) estimator also introduces cluster-specific 
effects but relaxes the assumption that these are purely random. Mixed 
models or hierarchical models permit richer within-cluster correlation 
structures than the RE model. For all of these approaches, one can obtain 
cluster—robust standard errors that guard against misspecification of the 
model for error correlation. The biggest distinction is that the FE estimator 
enables consistent parameter estimation under assumptions weaker than 
those needed for consistency of OLS and RE estimators. 


The chapter begins with a review of FGLS and application to cross- 
sectional data with heteroskedastic errors. Four sections of the chapter then 
detail the different models and inference methods available when model 
errors are clustered. Even though we consider a cross-sectional data 
example, the RE and FE estimators are most easily estimated using the xt 
commands developed for panel data. The chapter then considers 


multiequation seemingly unrelated regressions (SUR), a different example of 
correlated errors. 


Other linear model examples of FGLS include the population-averaged 
estimator and the RE estimator for panel data (sections 8.4 and 8.7) and the 
three-stage least-squares estimator for simultaneous-equations systems 
(section 7.10). Many of the methods carry over to nonlinear regression 
models, where efficient estimation is by ML rather than FGLS and stronger 
distributional assumptions may be required for estimator consistency. 
Extension to systems of nonlinear equations is given in section 18.11.2. 


This chapter concludes with a stand-alone presentation of a quite 
distinct topic: survey estimation methods that explicitly control for the three 
complications of data from complex surveys—sampling that is weighted, 
clustered, and stratified. 


6.2 Generalized least-squares and FGLS regression 


We provide an overview of theory for generalized least-squares (GLS) and 
FGLS estimation. 


6.2.1 GLS for heteroskedastic errors 


A simple example is the single-equation linear regression model with 
heteroskedastic independent errors, where a specific model for 
heteroskedasticity is given. Specifically, 


y= x. Btu, 1=1,...,N (6.1) 


uUi = o(Zi)Ei 


where £; satisfies F (e;|X;, zi) = 0, E(ei£;|Xi, Zi, Xj, Zj) = 0, i # j, and 
E(e?|xi, zi) = 1. The function o(z;), called a skedasticity function, is a 
specified scalar-valued function of the observable variables z;. The special 
case of homoskedastic regression arises if o(z;) = o, a constant. The 
elements of the vectors z and x may or may not overlap. 


Under these assumptions, the errors u; in (6.1) have zero mean and are 
uncorrelated but are heteroskedastic with variance o?(z;). Then OLS 
estimation of (6.1) yields consistent estimates, but more efficient estimation 
is possible if we instead estimate by OLS a transformed model that has 
homoskedastic errors. Transforming the model by multiplying by 
w; = 1/o(z;) yields the homoskedastic regression 


Cay E (5) 2 H ei (6.2) 


because u;/o(z;) = {o(z;)e;}/o(z;) = £; and £; is homoskedastic. The GLS 
estimator is the OLS estimator of this transformed model. This regression can 
also be interpreted as a weighted linear regression of Yi on X; with the 

weight w; = 1/o(z;) assigned to the jth observation. In practice, o(z;) may 


depend on unknown parameters, leading to the FGLS estimator, which uses 
the estimated weights o(z;) as explained below. 


6.2.2 GLS and FGLS 


More generally, we begin with the linear model in matrix notation: 
y=XG+u 


By the Gauss—Markov theorem, the OLS estimator is efficient among linear 
unbiased estimators if the linear regression model errors are zero-mean 
independent and homoskedastic. 


We suppose instead that E(uu’|X) = Q, where Q ¥ o7I for a variety of 
reasons that may include heteroskedasticity or clustering, or both. Then the 
efficient GLS estimator is obtained by OLS estimation of the transformed 
model 


Qo iy =: Q-12xB te 


where Q~!/29¢9~1/2/ — Į so that the transformed error 
ere aes ee, (0, I] is independent and homoskedastic. In the 
heteroskedastic case, Q = Diag{a?(z;)}, so Q-1!/2 — Diag{1/o(z,)}. 


In practice, Q is not known. Instead, we specify an error variance matrix 
model, Q = Q(x), that depends on a finite-dimensional parameter vector Y 
and, possibly, data. Given a consistent estimate ¥ of Y, we form Q = Q2(F)- 
Different situations correspond to different models for Q (y) and estimates 
of Q. The FGLS estimator is the OLS estimator from the regression of q g 
—1/2 


ong x and equals 


“A pee —1 Bares! 
Bras = (XÂ XK) X'y 


Under the assumption Q (~y) is correctly specified, the asymptotic variance— 
covariance matrix of the estimator (VCE) of Brees is (X Q x)! ; this is 
a two-step estimator but one for which the first-step estimation (of Q by Q) 
makes no difference to the asymptotic distribution of the second-step 
estimator (3,15): 


6.2.3 Robust standard errors for FGLS 


The FGLS estimator requires specification of a model, Q (y), for the error 
variance matrix. Usually, it is clear what general complication is likely to be 
present. For example, heteroskedastic errors are likely with cross-sectional 
data, but it is not clear what specific model for that complication is 
appropriate. If the model for Q(-y) is misspecified, then FGLSs is still 
consistent, though it is no longer efficient. More importantly, the usual VCE 
of Brats will be incorrect. Instead, a robust estimator of the vce should be 
used. 


We therefore distinguish between the true error variance matrix, 
E(uu’|X), and the specified model for the error variance, denoted by 
Q = 27). In the statistics literature, especially that for generalized linear 
models, Q (y) is called a working variance matrix. Form the estimate 
Q = Q(7), where ¥ is an estimate of y. Then, do FGLs with the weighting 
matrix Go but obtain a robust estimate of the vce. This estimator is also 
called a weighted least-squares (WLS) estimator to indicate that we no longer 
maintain that Q(y) = E(uu’|X). 


Table 6.1 presents formulas for the default, heteroskedastic—robust and 
cluster—robust VCE of the OLS and FGLS estimators. The heteroskedastic— 
robust VCE of the FGLS estimator requires that Q be diagonal and N — oo. 
The cluster—robust vcE of the FGLS estimator requires that Q = Diag(Q be 
block-diagonal and G => oo. 


Table 6.1. OLS and FGLS estimators and their estimated variance 


Estimator Estimator definition and estimated asymptotic variance 
OLS B= em 
Pactan (A )=s 
va (3) = (X'X 3 s x 30, @2x;x!) (XX)! 
Veruster (B) = XX) (ex Z, X 0X) XX) 


—1 


FGLS ĝ = (xa - xâ oe 


astute (B) = (x8 


Phet (8) = G os) xX, xix! /02,) (X'O ax) 


V cluster B C X a Xi, ugu Ny Xi 
notes: The cluster-robust estimate for FGLS assumes that Q = Diag(Q g), Where g = 1,...,G 


defines the clusters. The degrees-of-freedom corrections are b = N/(N — K) and 
c= {N/(N — K)}/{G/G — 1)}, where K is the number of regression parameters and G is the 


number of clusters. Gigg denotes the gth diagonal entry in Q as 


Consider the case of clustered errors. The OLS estimator can be expressed 
as B = 3+ (X’ X)7tx’ u, given y = X’3 + u. The OLS estimator therefore 
has asymptotic variance (X’X)~! E(X’uu’X) (X’X) given 
E(u|X) = 0. For gth cluster yg = X46 + Uy, we have 
E(X'uu'X) =>), >), E(X pugu, Xn) . Given error independence across 
clusters, the double sum reduces to )/,, E(Xjugu,-X,), which can be 
estimated by > `, XjU gt, X,, where U, are the OLs residuals and we assume 
G — co. This yields the cluster—-robust VCE given in the fourth row of 
table 6.1. Heteroskedasticity is just the special case where there is one 
observation per cluster. 


A similar argument leads to the cluster—-robust VCE for FGLS, where we 
need to restrict attention to FGLS estimators with Q = Diag(Q A so that FGLS 


estimation preserves independence across clusters and u, are the FGLS 
residuals. 


6.2.4 Limitations of cluster—robust inference 


The practitioner should be aware of several limitations of cluster—-robust 
inference: 1) it is not always clear whether there is a need to cluster; 

2) theory assumes that there are many clusters; and 3) the cluster-robust VCE 
can be rank deficient. 


There is no standard formal test for whether there is a need to obtain 
cluster—robust standard errors, rather than heteroskedastic—robust standard 
errors. Instead, one simply uses the vce (cluster clustvar) option if we are 
in a setting where there is likely to be intracluster correlation of both the 
regressor and the error. If there is a meaningful difference between the 
heteroskedastic—robust standard errors and cluster—robust standard errors, 
then cluster—robust standard errors should be used following OLS estimation, 
and efficiency gains may be possible by GLS estimation. 


Cluster—robust standard errors are based on the assumption that the 
number of clusters G is large. If the number of clusters is small, then 
cluster—robust inference leads to underestimation of standard errors, 
confidence intervals that are too narrow, and hypothesis tests that overreject. 
Studies suggest that G > 30 is desirable and that even larger G may be 
needed if the cluster sizes are unbalanced. When G is small, one should at 
the very least base inference on critical values from the t(G — 1), rather than 
the standard normal distribution. The regress command uses the t(G — 1) 
distribution, but many other commands, including xt reg, instead use 
standard normal critical values. Improved inference with few clusters is an 
active area of research; see section 6.4.6 for further discussion and the wild 
cluster bootstrap in section 12.6. 


Finally, the cluster—robust vce has rank at most the minimum of i (the 
dimension of 3) and G — 1. Thus, if K > G — 1, it is possible to test at 
most G — 1 restrictions. In that case, the usual test of joint statistical 
significance of all regressors is not possible, and computer output will report 
the joint test statistic as missing. This is not a cause for concern. Tests of 


fewer restrictions, including the usual tests of significance, and confidence 
intervals remain valid. 


6.2.5 Leading examples of FGLS 


The GLs framework is relevant whenever Q 4 o7I. We summarize several 
leading cases. 


Heteroskedastic errors have already been discussed at some length and 
can arise in many different ways. In particular, they may reflect specification 
errors associated with the functional form of the model. Examples include 
neglected random or systematic parameter variation, incorrect functional 
form of the conditional mean, incorrect scaling of the variables in the 
regression, and incorrect distributional assumptions regarding the dependent 
variables. A proper treatment of the problem of heteroskedasticity may 
therefore require analysis of the functional form of the regression. For 
example, in chapter 3, a log-linear model was found to be more appropriate 
than a linear model. 


Another common example is that of clustered (or grouped) errors, with 
errors being correlated within clusters but uncorrelated between clusters. A 
cluster consists of a group of observations that share some geographical, 
social, or economic trait that induces within-cluster dependence between 
observations, even after controlling for sources of observable differences. 
Such dependence can also be induced by other latent factors such as shared 
social norms, habits, or influence of a common local environment. In this 
case, Q can be partitioned by cluster. If all observations can be partitioned 
into G mutually exclusive and exhaustive groups, then Q can be partitioned 
into a block-diagonal matrix with G submatrices, simplifying the necessary 
inversion of the potentially very large N x N matrix Q. The cluster—robust 
VCE can be obtained because the partitioning of errors is preserved after 
inversion of the block-diagonal matrix Q. A leading example is the RE 
estimator; see section 6.5. 


For multivariate linear regression, such as the estimation of systems of 
equations, errors can be correlated across the equations for a specific 
individual. In this case, the model consists of m linear regression equations 


i= xi 3; + Uji, 7 =1,...,m, where the errors Uji are correlated over j 
for a given į but are uncorrelated over i. Then GLS estimation refers to 
efficient joint estimation of all m regressions. The three-stage least-squares 
estimator is an extension to the case of simultaneous-equations systems. 


6.3 Modeling heteroskedastic data 


Heteroskedastic errors are pervasive in microeconometrics. The failure of 
the assumption of homoskedasticity in the standard regression model leads 
to the OLS estimator being inefficient, though it is still a consistent estimator. 
Given heteroskedastic errors, there are two leading approaches. 


The first approach, taken in section 3.4.5, is to obtain robust estimates of 
the standard errors of regression coefficients without assumptions about the 
functional form of heteroskedasticity. Under this option, the form of 
heteroskedasticity has no interest for the investigator who wants to report 
only correct standard errors, ¢ statistics, and p-values. This approach is easily 
implemented in Stata, using the vce (robust) option, and is by far the most 
common approach taken in applied economics studies. 


The second approach seeks to model the heteroskedasticity and to obtain 
more efficient FGLS estimates. This enables more precise estimation of 
parameters and marginal effects and more precise prediction of the 
conditional mean. In practice, such gains in efficiency are often small, and 
OLS is commonly used. 


Unlike some other standard settings for FGLS, there is no direct Stata 
command for FGLS estimation given heteroskedastic errors. However, it is 
straightforward to obtain the FGLS estimator manually, as we now 
demonstrate. 


6.3.1 Simulated dataset 


We use a simulated dataset, one where the conditional mean of y depends on 
regressors £2 and £3, while the conditional variance depends only on £2. The 
specific data-generating process (DGP) 1s 


y=14+1xa24+1xa3+u; £2,423 ~ N(0, 25) 
u = f/exp(—1+0.2 x 42) xe; £~ N(0,25) 


Then the error u is heteroskedastic with a conditional variance of 
25 x exp(—1 + 0.2 x x2) that varies across observations according to the 


value taken by 2. 


We generate a sample of size 500 from this DGP: 


. * Generated data fo 
. set seed 10101 


. qui set obs 500 

. generate double x2 
. generate double x3 
. generate double e 
. generate double u 
. generate double y 


. summarize 


Variable 


r 


heteroskedasticity example 


5*rnormal (0) 


5*rnormal (0) 


5*rnormal (0) 


sqrt (exp(-1+0.2*x2))*e 


1 + 1*x2 + 1*x3 + u 


Obs 


Mean 


Std. dev. 


Min Max 


500 
500 
500 
500 
500 


-.0117399 
. 0534507 
. 2355246 
. 1644912 
1.206202 


5.073625 
5.061389 
5.131809 
4.234811 
8.475936 


-16.30832 15.46858 
-20.59562 12.92513 
-15.90812 12.4746 
-22.13982 24.04547 
-24.92206 41.67478 


The generated independent and identically distributed (1.1.d.) normal 
variables x2, x3, and, to a lesser extent, e have, approximately, means of 0 


and standard deviations of 5 as expected. 


6.3.2 OLS estimation 


oLs regression with default standard errors yields 


* OLS regression with default standard errors 
. regress y x2 x3 


Source SS df MS Number of obs = 500 
F(2, 497) = 748.65 

Model 26914.9701 2 13457.485 Prob > F = 0.0000 
Residual 8933 .93256 497 17.9757194 R-squared = 0.7508 
Adj R-squared = 0.7498 

Total 35848 . 9026 499 71.8414882 Root MSE = 4.2398 

y | Coefficient Std. err. t P>|tl [95% conf. interval] 

x2 1.033346 .0374193 27.62 0.000 . 959827 1.106866 

x3 9919595 .0375098 26.45 0.000 . 9182623 1.065657 

_cons 1.165312 . 1896199 6.15 0.000 . 792757 1.537868 


The coefficient estimates are close to their true values, and the 95% 
confidence intervals include the DGP values. The estimates are quite precise 
because there are 500 observations, and for this generated dataset, the 


R2 = 0.75 is very high. 


The standard procedure is to obtain heteroskedasticity-robust standard 
errors for the same OLS estimators. We have 


* OLS regression with heteroskedasticity-robust standard errors 
. regress y x2 x3, vce(robust) 


Linear regression Number of obs = 500 
F(2, 497) = 501.02 
Prob > F = 0.0000 
R-squared = 0.7508 
Root MSE = 4.2398 
Robust 
y | Coefficient std. err. t P>|t| [95% conf. interval] 
x2 1.033346 .0625617 16.52 0.000 .9104285 1.156265 
x3 .9919595 . 0342737 28.94 0.000 . 9246203 1.059299 
_cons 1.165312 . 1899729 6.13 0.000 . 7920633 1.538561 


In general, failure to control for heteroskedasticity leads to default standard 
errors being wrong, though a priori it is not known whether they will be too 
large or too small. In our example, we expect the standard errors for the 
coefficient of x2 to be most effected because the heteroskedasticity depends 
on x2. This is indeed the case. For x2, the robust standard error is 70% 
higher than the incorrect default (0.063 versus 0.037). The original failure to 


control for heteroskedasticity led to wrong standard errors, in this case, 
considerable understatement of the standard error of x2. For x3, there is 
much less change in the standard error. 


6.3.3 Detecting heteroskedasticity 


A simple informal diagnostic procedure is to plot the absolute value of the 
fitted regression residual, |@;|, against a variable assumed to be in the 
skedasticity function. The regressors in the model are natural candidates. 


The following code produces separate plots of |v;| against £2; and |u;| 
against £3; and then combines these into one figure (shown in figure 6.1) by 
using the graph combine command; see section 2.6. Several options for the 
twoway command are used to improve the legibility of the graph. 


* Heteroskedasticity diagnostic scatterplot 
. qui regress y x2 x3 


. predict double uhat, resid 
. generate double absu = abs(uhat) 


. qui twoway (scatter absu x2) (lowess absu x2, bw(0.4) lw(thick)), 


> legend(off) ytitle("Absolute value of residual") yscale(titleg(*5) ) 
> xscale(titleg(*5)) plotr(style(none)) name(glsi, replace) 

. qui twoway (scatter absu x3) (lowess absu x3, bw(0.4) lw(thick)), 

> legend(off) ytitle("Absolute value of residual") yscale(titleg(*5) ) 
> xscale(titleg(*5)) plotr(style(none)) name(gls2, replace) 


. graph combine glsi gls2, iscale(1.2) ysize(2.5) xsize(6.0) 
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Figure 6.1. Absolute residuals graphed against £2 and x3 


It is easy to see that the range of the scatterplot becomes wider as X2 
increases, with a nonlinear relationship, and is unchanging as “3 increases. 
These observations are to be expected given the DGP. 


We can go beyond a visual representation of heteroskedasticity by 
formally testing the null hypothesis of homoskedasticity against the 
alternative that residual variances depend upon a) £2 only, b) z3 only, and c) 
x2 and z3 jointly. Given the previous plot (and our knowledge of the DGP), 
we expect the first test and the third test to reject homoskedasticity, while the 
second test should not reject homoskedasticity. 


These tests can be implemented using the postestimation command 
estat hettest, introduced in section 3.7.4. It is simplest to use the mtest 
option, which performs multiple tests that separately test each component 
and then test all components. We have 


. * Test heteroskedasticity depending on x2, x3, and x2 and x3 
. estat hettest x2 x3, mtest 


Breusch-Pagan/Cook-Weisberg test for heteroskedasticity 
Assumption: Normal error terms 
HO: Constant variance 


Variable chi2 df p 


x2 
x3 


390.13 1 0.0000* 
3.93 1 0.0474* 


Simultaneous 392.43 2 0.0000 


* Unadjusted p-values 


The p-value for x2 is 0.000, causing us to reject the null hypothesis that the 
skedasticity function does not depend on x2. We conclude that there is 
heteroskedasticity due to x2. The p-value for x3 is 0.0474, so at level 0.05 
we just reject the null hypothesis that the skedasticity function does not 
depend on x3, which is not the correct decision in this case. The p-value of 
0.000 for the joint (simultaneous) hypothesis leads us to conclude that the 
skedasticity function depends on at least one of x2 and x3. 


The mtest option is especially convenient if there are many regressors 
and, hence, many candidates for causing heteroskedasticity. It does, 


however, use the version of estat hettest that assumes that errors are 
normally distributed. To relax this assumption to one of independent and 
identically distributed errors, we need to use the iid option (see 

section 3.7.4) and conduct separate tests. Doing this leads to test statistics 
(not reported) with values lower than those obtained above, and the null 
hypothesis of no heteroskedasticity is rejected for x2 and not rejected for x3. 


6.3.4 FGLS estimation with heteroskedastic errors 


For potential gains in efficiency, we can estimate the parameters of the 
model by using the two-step FGLS estimation method presented in 
section 6.2.2. For heteroskedasticity, this is easy: from (6.2), we need to 
1) estimate G? and 2) OLS regress y;/G; on x; /G;. 


At the first step, we estimate the linear regression by OLS, save the 
residuals &; = y — x'B og? estimate the skedasticity function o? (zi, y) by 
regressing ti? on o? (z;, y), and get the predicted values $° (z;, 9). Here our 
tests suggest that the skedasticity function should include only x2. We 
specify the skedasticity function go? (z) = exp(y1 + y2x2) because taking 
the exponential ensures a positive variance. This is a nonlinear model that 
needs to be estimated by nonlinear least squares. We use the nı command, 
which is explained in section 13.3.6. 


The first step of FGLs yields 


. * FGLS: First step get estimate of skedasticity function 
. drop uhat 


. qui regress y x2 x3 // Get bols 

. predict double uhat, resid 

. generate double uhatsq = uhat^2 // Get squared residual 
. nl (uhatsq = exp({xb: x2}+{b0})), nolog // NLS of uhatsq on exp(z‘a) 


Source SS df MS 
Number of obs = 500 
Model 890038. 92 2 445019.458 R-squared = 0.5975 
Residual 599508 .9 498 1203.83313 Adj R-squared = 0.5959 
Root MSE = 34.6963 
Total 1489547 .8 500 2979.09563 Res. dev. = 4963.568 
uhatsq | Coefficient Std. err. t P>|tl [95% conf. interval] 
/xb_x2 . 3002613 012056 24.91 0.000 . 2765743 . 3239482 
/b0 1.586178 . 1561775 10.16 0.000 1.27933 1.893026 

. predict double varu, yhat // Get sigmahat^2 


Note that x2 explains a good deal of the heteroskedasticity (R? = 0.60) and 


is highly statistically significant. For our DGP, 
o7(z) = 25 x exp(—1 + 0.2%2) = exp(In 25 — 1 + 0.2r2) = exp(2.22 + 0.222) 
, and the estimates are 1.59 and 0.30. 


At the second step, the predictions °? (z) define the weights that are used 
to obtain the FGLS estimator. Specifically, we regress y;/o; on x; /0;, where 
C? = exp(41 + Jo%2;). This weighting can be done automatically by using 


aweights in estimation. If the aweight is w;, then OLS regression is of ,/w;y; 
on \/w;i Xi. Here we want the aweight to be 1/G?, or 1/varu. Then, 


. * FGLS: Second step get estimate of skedasticity function 
. regress y x2 x3 [aweight=1/varu] 
(sum of wgt is 338.0575221018863) 


Source SS df MS Number of obs = 500 
F(2, 497) = 2530.65 

Model 17266.2105 2 8633.10525 Prob > F = 0.0000 
Residual 1695 .47363 497 3.41141576 R-squared = 0.9106 
Adj R-squared = 0.9102 

Total 18961.6841 499 37.999367 Root MSE = 1.847 

y | Coefficient Std. err. t P>|t| [95% conf. interval] 

x2 1.037914 0171689 60.45 0.000 1.004181 1.071646 

x3 1.01217 .0174524 58.00 0.000 . 9778808 1.04646 

_cons 1.348954 . 1548488 8.71 0.000 1.044715 1.653193 


As already noted, FGLS is a special case of a two-step estimator where it is 
not necessary to adjust the standard errors at the second step. Comparison 
with previous results for OLS with the correct robust standard errors shows 
that the estimated confidence intervals are narrower for FGLS. For example, 
for x2 the improvement is from [0.91, 1.16] to [1.00, 1.07]. As predicted by 
theory, FGLS with a correctly specified model for heteroskedasticity is more 
efficient than OLS. 


The hetregress command with option twostep implements a similar 
WLS method, except that the weights are the exponential of the fitted value 
obtained from OLs regression of In(@?) on Zi. 

In practice, the form of heteroskedasticity is not known. Then a similar 
favorable outcome may not occur, and we should create more robust 
standard errors as we next consider. 


6.3.5 Heteroskedastic-robust standard errors for the WLS estimator 


The FGLS standard errors are based on the assumption of a correct model for 
heteroskedasticity. To guard against misspecification of this model, we use 
the WLS estimator presented in section 6.2.3, which is equal to the FGLS 
estimator but uses robust standard errors that do not rely on a model for 
heteroskedasticity. We have 


. * WLS estimator is FGLS with robust estimate of VCE 
. regress y x2 x3 [aweight=1/varu], vce(robust) 
(sum of wgt is 338.0575221018863) 


Linear regression Number of obs = 500 
F(2, 497) = 2048.50 
Prob > F = 0.0000 
R-squared = 0.9106 
Root MSE = 1.847 
Robust 
y | Coefficient std. err. t P>|t| [95% conf. interval] 
x2 1.037914 .0201851 51.42 0.000 . 9982548 1.077572 
x3 1.01217 .0201042 50.35 0.000 .9726707 1.05167 
-cons 1.348954 . 1654809 8.15 0.000 1.023826 1.674083 


The robust standard errors for FGLS are somewhat similar to the default 
standard errors for FGLS, as expected because here FGLS is known to use the 
DGP model for heteroskedasticity. They are substantially smaller than the 
robust standard errors for OLS. 


6.4 OLS for clustered data 


Cluster—robust inference for OLS was introduced in section 3.4.6. Here we 
provide greater detail. 


Before we proceed, however, note that the topic of inference with 
clustered errors should not be confused with the quite different topic of 
cluster analysis, identifying groupings within the data, which can be 
implemented using the cluster command. 


We consider OLS estimation for data on individual 7 in cluster g (of G 
clusters), so 


Yoi =X ph + Ugi; tl yenag Ny g=1,...,G 


where clustering is on the first subscript 9. Note that we could instead use 
the alternative ordering of the subscripts Yig. We use Ygi here to be consistent 
with short panel data yit for which clustering is often on the individual, the 
first subscript. 


6.4.1 Clustered dataset 


We consider data on use of medical services by individuals, where 
individuals are clustered within household and, additionally, households are 
clustered in villages (called communes in this dataset). The data, from 
Vietnam, are the same as those used in Cameron and Trivedi (2005, 852). 


Before analysis, we drop one household whose members were split 
between two communes and drop one observation with missing data. We 
have 


. * Read in Vietnam clustered data, and delete one household in two communes 
. qui use mus206vlss, clear 


. drop if lnhhexp > 2.579681 & Inhhexp < 2.579683 
(12 observations deleted) 


. drop if missing (lnhhexp) 
(1 observation deleted) 


The dependent variable is the number of direct pharmacy visits 
(pharvis). The independent variables are the logarithm of household 
medical expenditures (1nhhexp) and the number of illnesses (i11ness). The 
data cover 12 months. 


The commune variable identifies the 194 separate villages. However, the 
dataset does not include a household identifier. We create the household 
identifier hh using the fact that the 1nhhexp variable takes on a distinct value 
for each household. We have 


* Create a unique household identifier and summarize data 
. egen hh = group(lnhhexp) 


summarize pharvis lnhhexp illness hh commune 


Variable Obs Mean Std. dev. Min Max 
pharvis 27,753 .5110439 1.312606 (0) 30 
lnhhexp 27,753 2.60262 .6245493 .0467014 5.405502 
illness 27,753 .6218427 .8995018 (0) 9 

hh 27,753 3097.765 1601.653 1 5739 
commune 27,753 101.514 56.27264 1 194 


There are 5,739 households. 


The pharvis variable is a count that is best modeled by using count 
regression commands such as poisson and xtpoisson; this is done in 
section 13.9. In this chapter, we use the same data to illustrate linear 
regression with clustered data. 


6.4.2 OLS and cluster—robust standard errors 


One complication of clustering is that the error is correlated within cluster. If 
that is the only complication, then valid inference simply uses standard 
cross-sectional estimators along with cluster—robust standard errors that are 


given in table 6.1. The overall F statistic and individual ¢ statistics and 
confidence intervals use G — 1 degrees of freedom. 


. * OLS estimation with cluster--robust standard errors clustering on household 
. regress pharvis lnhhexp illness, vce(cluster hh) 


Linear regression Number of obs = 27,753 
F(2, 5738) = 579.82 
Prob > F = 0.0000 
R-squared = 0.1817 
Root MSE = 1.1874 
(Std. err. adjusted for 5,739 clusters in hh) 

Robust 
pharvis | Coefficient std. err. t P>|tl [95% conf. interval] 
lnhhexp .0247159 0139563 1.77 0.077 - .0026437 0520754 
illness . 6235444 .0183113 34.05 0.000 . 5876474 . 6594414 
_cons 0589713 . 03666 1.61 0.108 -.0128961 . 1308387 


One more illness is associated with 0.624 more pharmacy visits, a very 
large effect. By contrast, an increase in household medical expenditures has 
a very small effect because a one-unit change in the natural logarithm of 
household medical expenditures, a very large change, is associated with only 
0.025 more pharmacy visits. 


Here we contrast the different standard errors obtained with no 
correction for clustering, clustering on household, and clustering on village. 
We have 


* OLS estimation with various cluster--robust standard errors 
qui regress pharvis lnhhexp illness 


estimates store OLS_iid 

qui regress pharvis lnhhexp illness, vce(robust) 
estimates store OLS_het 

qui regress pharvis lnhhexp illness, vce(cluster hh) 
estimates store OLS_hh 


. qui regress pharvis lnhhexp illness, 
> vce(bootstrap, cluster(hh) seed(10101) reps(400)) 


estimates store OLS_boot 
qui regress pharvis lnhhexp illness, vce(cluster commune) 


estimates store OLS_comm 


. estimates table OLS_iid OLS_het OLS_hh OLS_boot OLS_comm, 


> b(%410.4f) se stats(r2 N) 
Variable OLS_iid OLS_het OLS_hh OLS_boot OLS_comm 
lnhhexp 0.0247 0.0247 0.0247 0.0247 0.0247 
0.0115 0.0109 0.0140 0.0135 0.0211 
illness 0.6235 0.6235 0.6235 0.6235 0.6235 
0.0080 0.0141 0.0183 0.0177 0.0342 
_cons 0.0590 0.0590 0.0590 0.0590 0.0590 
0.0316 0.0292 0.0367 0.0361 0.0556 
r2 0.1817 0.1817 0.1817 0.1817 0.1817 
N 27753 27753 27753 27753 27753 


Legend: b/se 


The effect of correction of standard errors for heteroskedasticity is 
unknown a priori. Here there is little effect on the standard errors for the 
intercept and 1nhhexp, though there is a large increase for illness. 


More importantly, controlling for clustering is expected to increase 
reported standard errors, especially for regressors highly correlated within 
the cluster. Here standard errors increase by around 30% as we move to 
clustering on household and by another 50% or more as we move to 
clustering on village. In total, clustering on village leads to a doubling of 
standard errors compared with heteroskedastic—robust standard errors. 


The fourth column provides cluster-pairs bootstrap standard errors based 
on 400 bootstrap replications, where the bootstrap resampling is over 
households. These are close (within 4%) to the cluster-robust standard errors 
given in the third column, as expected given the asymptotic equivalence of 
the two methods. 


6.4.3 Intracluster correlation and inflation of OLS standard errors 


The impact of clustering on default OLs standard errors depends in part on 

the intracluster correlation of regressors and the intracluster correlation of 

model errors. The intracluster correlation coefficient can be obtained using 
the loneway command. 


* Within commune intracluster correlation for lnhhexp 


. loneway lnhhexp commune 


One-way analysis of variance for lnhhexp: log of total household expenditu 


Number of obs = 27,753 
R-squared = 0.5014 
Source SS df MS F Prob > F 
Between commune 5427 . 1435 193 28.119914 143.57 0.0000 
Within commune 5397.8513 27,559 . 19586528 
Total 10824.995 27,752 . 39006179 
Intraclass Asy. 
correlation S.E. [95% conf. interval] 
0.49920 0.02621 0.44782 0.55057 
Estimated SD of commune effect .4418549 
Estimated SD within commune .4425667 
Est. reliability of a commune mean 0.99303 


(evaluated at n=143.03) 


For variable 1nhhexp, the within-commune intracluster correlation is 0.499. 


Cluster—robust standard errors are approximately 


T = 4/14 PxPu(M — 1) times the default ots standard errors, where Px and 
Pu are intracluster correlations of x and u and M is the average number of 
observations per cluster (see section 3.4.6). The following code does this 
computation for the standard error of the estimated coefficient of Inhhexp 
when clustering is on commune. We have 


* Standard-error inflation factor for b_lnhhexp when clustering on commune 


. qui loneway lnhhexp commune 


scalar rho_x = r(rho) 


. qui regress pharvis lnhhexp illness 


scalar numobs = e(N) 


. predict uhat, resid 


. qui loneway uhat commune 


scalar rho_uhat = r(rho) 


. qui sum 


scalar M = numobs/194 


. display "rho_x = " rho_x " rho_u = " rho_uhat " M='"M 


> 


-n "Standard error inflation = " sqrt(1+rho_x*rho_uhat*(M-1)) 


rho_x = .49919517 rho_u = .04866825 M = 143.0567 
Standard error inflation = 2.1098013 


This calculation of standard error inflation of 2.11 1s only a guide. In fact, 
the standard error increased from 0.0115 to 0.0211, an inflation factor of 
0.0211/0.0115 = 1.83. But the computation does provide some intuition. 
While the within-commune correlation of residuals was low (0.049), there 
were many individuals per cluster (143 on average) and high within- 
commune correlation of Inhhexp. 


A variation mentioned in section 3.4.6 uses r = / 1+ pru(M — 1), 
where pzu is the within-cluster correlation of TgiUgi. This yields estimated 
standard error inflation of 2.68. 


6.4.4 Two-way cluster—robust standard errors for OLS 


Two-way clustering arises when clustered errors arise for two distinct 
reasons that are nonnested, such as for matched employer—employee data 
with potentially multiple observations for each employer and each employee. 


Let (7, denote the cluster—robust variance vce obtained by clustering on 
the first dimension, say, employer; Y denote the estimate obtained by 
clustering on the second dimension, say, employee; and V}, denote the 
estimate obtained by clustering on the intersection of the two ways, say, 
employer—employee pair. Then the two-way cluster—robust estimator of the 
VCE of the OLS estimator is 


Vic = Vi + V = Viz (6.3) 


where the subtraction of Y,, corrects for overcounting of observations that 
were clustered in both dimensions. This variance estimator, proposed 
independently by Cameron, Gelbach, and Miller (2011), Thompson (2011), 


number of clusters in each dimension goes to infinity. 


The community-contributed command vcemway (Gu and Yoo 2019) 
implements the two-way variance estimate not only for OLS but for any 
estimation command that stores in e (v) a one-way cluster—robust variance 


matrix estimate. The default follows the community-contributed cmgreg 
command (Cameron, Gelbach, and Miller 2011) and uses different numbers 
of clusters in computing the degrees-of-freedom adjustments for the three 
variances in (6.3). The vcemway option vmcfactor (minimum) instead uses 
Gmin = min(G, H) in computing each of V, V, and V}, as do the ivreg2 
command (Baum, Schaffer, and Stillman 2007) with option sma11 (presented 
in section 7.6.9) and the community-contributed command reg2hdfe 
(Guimaraes and Portugal 2010). 


The two-way cluster—robust VCE is not guaranteed to be positive definite, 
because while each of the three component matrices in (6.3) is positive 
definite, the subtraction of the third matrix can lead to a vce that is not 
positive definite. Then standard errors cannot be computed. The vcemway 
command follows the cmgreg command by setting any negative eigenvalues 
of co to zero and then reconstructing the VCE using an eigenvalue 
decomposition of the matrix. A conservative alternative instead drops the 
third term in (6.3). 


In the current application, there is no obvious reason for clustering on a 
second dimension. For purely illustrative purposes, we cluster in two 
nonnested dimensions: commune and illdays. Using the vcemway command, 
we obtain 


. * Two-way cluster--robust standard errors using vcemway command 
. vcemway regress pharvis lnhhexp illness, cluster(commune illdays) 


Linear regression Number of obs = 27,753 
F(2, 193) = 171.28 
Prob > F = 0.0000 
R-squared = 0.1817 
Root MSE = 1.1874 
(Std. err. adjusted for clustering on commune illdays) 
Robust 
pharvis | Coefficient std. err. t P>|t| [95% conf. interval] 
lnhhexp .0247159 .0312952 0.79 0.436 -.039111 .0885428 
illness .6235444 0661737 9.42 0.000 . 4885824 . 7585065 
_cons .0589713 .061069 0.97 0.342 -.0655797 . 1835223 
Notes: 
Std. Err. adjusted for 2-way clustering on commune illdays 
Number of clusters in commune = 194 
Number of clusters in illdays = 32 


Stata’s default small-cluster correction factors have been applied. See 
> vcemway for detail. 


Residual degrees of freedom for t tests and F tests = 31 


F(,) and Prob > F above only account for one-way clustering on commune. 
Use test to compute F(,) and Prob > F that account for 2-way clustering. 


The two-way cluster—robust standard errors are larger than the earlier 
one-way cluster—robust standard errors with clustering on commune. The p- 
values and confidence intervals are based on the ¢(31) distribution because 
here min(G, H) = 32. For estimation commands such as logit that use the 
standard normal distribution, it is better to use the vcemway option vmdfr (#), 
where # equals min(G, H) — 1. 


Finally, we compare separate one-way clustering on commune or on 
illdays with two-way clustering on both variables. We have 


* Effect of two-way versus separate one-way clustering 
qui regress pharvis lnhhexp illness, vce(robust) 


estimates store het 

qui regress pharvis lnhhexp illness, cluster (commune) 
estimates store commune 

qui regress pharvis lnhhexp illness, cluster(illdays) 


estimates store illdays 


qui vcemway regress pharvis lnhhexp illness, cluster(commune illdays) 


estimates store twoway 


qui vcemway regress pharvis lnhhexp illness, cluster(commune illdays) 


vmcfactor (minimum) 


estimates store twowayalt 


. estimates table het commune illdays twoway twowayalt, b(%10.4f) se stats(r2 


N) 


Variable het commune illdays twoway 
lnhhexp 0.0247 0.0247 0.0247 0.0247 
0.0109 0.0211 0.0271 0.0313 
illness 0.6235 0.6235 0.6235 0.6235 
0.0141 0.0342 0.0599 0.0662 
_cons 0.0590 0.0590 0.0590 0.0590 
0.0292 0.0556 0.0448 0.0611 
r2 0.1817 0.1817 0.1817 0.1817 
N 27753 27753 27753 27753 


twowayalt 


0. 


0 
(0) 
O. 
(0) 
(0) 
(0) 


. 0247 
.0314 
6235 
.0663 
.0590 
.0614 


1817 
27753 


Legend: b/se 


The standard errors with two-way clustering exceed those with either 


separate one-way clustering. 


6.4.5 Dyad-robust standard errors for OLS 


Dyadic data are data with dependent variable Yi; that measures the 
relationship between individual ; and j. Examples include friendship 


interactions, interhousehold interactions, and trade flows between countries. 


For such data, it is possible that a model error “ij will be correlated with 
every other model error Ux: if any of the two components of the pair (k, L) is 
common with any of the two components of the pair (i, 7). Two-way 
cluster—robust standard errors control for error correlation when k = 7 or 


l = j, or both. Dyadic-robust standard errors additionally control for error 
correlation when k = j or ] = i, or both. 


For the formula for dyadic-robust standard errors, see Cameron and 
Miller (2014) or Aranow, Samii, and Assenova (2015), and for asymptotic 
theory, see Tabord-Meehan (2019). 


The need to control for dyadic correlation is greater the more links there 
are across observations. Fafchamps and Gubert (2007) proposed this 
correction for social networks across households, where households on 
average interacted with only a few other households, and found that the 
dyadic correction made little difference to standard errors. 


By contrast, Cameron and Miller (2014) considered data on trade flows 
between many countries and found that dyadic-robust standard errors could 
be several times standard errors that do not make any correction (and are 
about 25% larger than two-way cluster—robust standard errors). Furthermore, 
if panel data Yij,t are available and dyad-specific fixed effects are included, 
there is still great need to control for dyadic error correlation. 


6.4.6 Cluster—robust inference with few clusters 


The underlying asymptotic theory assumes that the number of clusters 

G — oo. It is not unusual to have applications where there are few clusters, 
such as when clustering is on state or province. In that case, the usual 
asymptotic theory can lead to tests with considerable size distortion, 
typically overrejecting, and confidence intervals that typically undercover. 


A similar situation would arise in a cross-sectional setting with, say, 
N = 20 independent observations, but with so few observations, estimation 
is then often so imprecise that there is no reason to go further and use 
improved inferential methods. By contrast, it is not unusual to obtain 
reasonably precise estimates with G = 10 or even lower, given many 
observations per cluster, so there is then a reason to improve on the usual 
asymptotic methods. 


A simple correction is to base inference on the ¢ and F distributions 
rather than the standard normal and chi-squared distributions. This can be 


done after any estimation command using the test command or testparm 
command with option af (#); a common choice is to use degrees of freedom 
G — 1. A few Stata linear estimation commands, notably regress, 
automatically provide output with tests and confidence intervals for 
coefficients based on the ¢(G — 1) and F(q, G — 1) distributions. And the 
ivregress with option sma11 does this. But especially nonlinear commands 
such as poisson provide output based only on standard normal and chi- 
squared distributions. The option af (#) of the test command following 
any estimation command uses the F (9, #) distribution. 


Suppose G = 10. Ifa Wald test has t = 1.96, then using the t(G — 1) 
distribution leads to a two-sided p-value of 0.082, computed as 
2*ttail(9,1.96), rather than 0.05 if the standard normal distribution was 
used. Even this correction is not enough in practice, and simulation studies 
find that the effective degrees of freedom can be much less than G — 1. This 
is especially the case if the clusters are unbalanced because of differences in 
cluster sizes or substantial differences in regressor values across clusters. 
Carter, Schnepel, and Steigerwald (2017) present results, and the associated 
community-contributed clusteff command (Lee and Steigerwald 2018), 
illustrated in section 12.6.3, provides a conservative estimate of the effective 
number of clusters. 


With unbalanced clusters, some clusters can greatly influence parameter 
estimates and also lead to test size distortion. MacKinnon, Nielsen, and 
Webb (Forthcoming) provide a discussion, and the community-contributed 
summcluster package (MacKinnon, Nielsen, and Webb 2022) provides 
measures of high-leverage clusters and influential clusters. 


And with few clusters the vcE should use a finite degrees of freedom 
correction such as Stata’s scaling by G/(G — 1) or, for linear regression, 
N/(N — K) times G/(G — 1). 


Some specialized methods that are oriented to specific settings of few 
clusters have been proposed; Cameron and Miller (2015) and MacKinnon, 
Nielsen, and Webb (Forthcoming) provide reviews. These include settings 
with many observations per clusters and settings with just one treated cluster. 
If the regressor of interest varies within cluster, Ibragimov and 
Müller (2010) propose a test that is conservative (provided the test level 


a < 0.083) based on separate OLS estimation in each cluster. If the regressor 
of interest is the same for all units in the cluster, which is often the case in a 
treatment evaluation setting, then Donald and Lang (2007) propose use of a 
two-step estimator under the assumption of homogeneous clusters. 
Hagemann (2020) proposes a rearrangement test when there is only one 
treated cluster and provides many related references. 


A general method for inference with few clusters applies the percentile-t 
method to a particular bootstrap, the wild cluster bootstrap; and the 
community-contributed boottest command (Roodman et al. 2019) 
implements this bootstrap; see section 12.6. The command can be applied 
following a wide range of estimation commands used in analyzing clustered 
data, including following regress, areg, and xtreg, fe. Improved inference 
with few clusters is a current area of research. 


6.5 FGLS estimators for clustered data 


For cross-sectional datasets, such as data on individuals with clustering on 
village, there are two ways to implement FGLS with clustered data. First, one 
can adapt panel-data methods, using the xt reg commands, because panel 
data also have clustered errors under the standard panel assumption of 
uncorrelated errors across individuals but error correlation over time for a 
given individual. Second, one can directly use multilevel mixed-effects 
regression, using the mixed command. In the simplest case of models with 
an 1.1.d. normally distributed random intercept, the two methods lead to the 
same results. 


The first approach is the approach usually taken by economists and is 
presented in sections 6.5 and 6.6. The second approach, used in many other 
disciplines, is deferred to section 6.7. 


In sections 6.5 and 6.6, we use the xtset command, the xtreg command 
(with options pa, re, and fe), and the xtdescribe command. These 
commands are presented in great detail in chapter 8; the presentation in the 
current chapter is much briefer. 


6.5.1 Defining the cluster identifier 


While we apply panel-data commands to this cross-sectional example, it is 
very important to note an important conceptual change from the panel-data 
case. 


For panels, there are multiple observations per individual, so clustering is 
on the individual (7). Here, instead, there are multiple observations per 
household or per village, so clustering is on the household or village (g). 


To use the xt commands, we first need to define the variable that 
identifies the cluster, using the xtset command. In the current cross- 
sectional setting, this is the household or village. 


In the remainder of this chapter, we focus on clustering on the 
household. We have 


* Define the cluster identifier variable 
xtset hh 


Panel variable: hh (unbalanced) 


6.5.2 FGLS estimator for population-averaged model 


The FGLS estimator requires specification of a model for within-cluster 
correlation of the error Ug: in the linear model y,i = X yi Brio 


A natural model for data clustered on, for example, village or region or 
school is that the error correlation is the same for any pair of individuals in 
the same cluster. That is, Cor(ugi, Ugj) = p for i # j, an assumption called 
one of equicorrelation or of exchangeable errors. 


The consequent FGLS estimator can be obtained using the xtreg, pa 
command with option corr (exchangeable). If equicorrelation is the wrong 
model for error correlation, the FGLS estimator of 3 remains consistent, but 
default standard errors are inconsistent. Instead we use cluster—robust 
standard errors. We obtain 


. * FGLS with equicorrelated errors using using xtreg, pa 
. xtreg pharvis lnhhexp illness, pa corr(exchangeable) vce(robust) 


Iteration 1: tolerance = .01925861 


Iteration 2: tolerance = .00001094 
Iteration 3: tolerance = 7.437e-09 


GEE population-averaged model Number of obs = 27,753 
Group variable: hh Number of groups = 5,739 

Family: Gaussian Obs per group: 
Link: Identity min = 1 
Correlation: exchangeable avg = 4.8 
max = 19 
Wald chi2(2) = 1346.73 
Scale parameter = 1.409849 Prob > chi2 = 0.0000 
(Std. err. adjusted for clustering on hh) 

Robust 

pharvis | Coefficient std. err. Zz P>lz| [95% conf. interval] 
lnhhexp .0197198 .0146096 1.35 0.177 -.0089145 .0483541 
illness .6185181 .0168691 36.67 0.000 . 5854552 .6515809 
_cons 0793774 .0388975 2.04 0.041 0031398 .1556151 


From supplemental output, the FGLS estimates are based on a within-cluster 
error correlation estimate of 0.164. 


6.5.3 Cluster-specific random-effects model 


For individual ; in cluster g (of G clusters), the cluster-specific effects model 
is 


Ygi = Xyi + Og + Egi; t=1,...,Ng, g=1,...,G (6.4) 


The change from the usual linear regression model is to allow the intercept 
Qg to vary across the G clusters, which will at least partially control for 
within-cluster correlation. 


In the RE model, &g is an i.i.d. random variable. (In the FE model 
presented in the subsequent section, “g is not 1.1.d. and furthermore is 
potentially correlated with the regressors.) 


More precisely, the RE model assumes that in (6.4) the cluster-specific 
random effect Qy is i.i.d. (0, 72) and the individual-specific effect is i.i.d. 
(0, 02). The additional component “gy in the model error induces within- 
cluster error correlation. Specifically, for different individuals in the same 
cluster, the error correlation is o2 /(a? + 02). This is an example of 
equicorrelation, with error correlation the same for any pair of observations 
in the cluster. This assumption of exchangeable errors is quite reasonable 
here because there is no natural ordering of household members. We expect 
similar results to those from the preceding population-averaged estimator 
based on exchangeable errors. 


6.5.4 RE estimator 


The RE FGLS estimator can be obtained using the xt reg, re command 
defined in section 8.2.4. 


If the RE model for the errors is misspecified, then the RE estimator of 3 
remains consistent, but default standard errors will be incorrect. Instead, we 
report cluster—robust standard errors that are valid provided G - oo. 


. * Random hh effects FGLS using xtreg, re 
. xtreg pharvis lnhhexp illness, re vce(robust) 


Random-effects GLS regression Number of obs = 27,753 

Group variable: hh Number of groups = 5,739 
R-squared: Obs per group: 

Within = 0.1541 min = 1 

Between = 0.2071 avg = 4.8 

Overall = 0.1817 max = 19 

Wald chi2(2) = 1371.37 

corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 


(Std. err. adjusted for 5,739 clusters in hh) 


Robust 

pharvis | Coefficient std. err. Zz P>|z| [95% conf. interval] 
lnhhexp 0182668 0150469 1.21 0.225 -.0112245 0477581 
illness .6165335  .0166859 36.95 0.000 . 5838298 .6492372 

_cons 0858413 .0404213 2.12 0.034 006617 . 1650655 
sigma_u . 62622406 
sigma_e 1.0606587 

rho . 25848181 (fraction of variance due to u_i) 


The slope coefficients are 0.018 and 0.617 compared with 0.025 and 0.624 
for OLS. The estimates of ca and ce are, respectively, 0.626 and 1.061, so the 
household-specific random effect is important, leading to within-household 
error correlation of 0.258. 


A variation of the RE estimator is to additionally assume that &%g and Egi 
are normally distributed. Then the MLE can be obtained using the xt reg, 
mle command. From output given below, this yields very similar slope 
coefficients of 0.0192 and 0.6179. The MLE estimates of ca and oz are, 
respectively, 0.524 and 1.071. 


6.5.5 Cluster—robust standard errors for the RE estimator 


The default standard errors, based on the vce formula given in the sixth row 
of table 6.1, are valid only if the model for the errors is correctly specified, 


that is, if %g and Egi are i.1.d. These are strong assumptions that impose 
equicorrelation of errors in the same cluster and assume homoskedasticity. 


We should instead use cluster—robust standard errors, based on the VCE 
formula given in the last row of table 6.1. This is possible because the RE 
estimator preserves independence across clusters, though we need to assume 
that the number of clusters is large. 


We compare VCEs in increasing order of control for clustering: no 
control, household clusters, and commune clusters. We have 


. * Random hh effects FGLS with various standard errors 
. qui xtreg pharvis lnhhexp illness, re mle 


. estimates store RE_MLE 

. qui xtreg pharvis lnhhexp illness, re 

. estimates store RE_def 

. qui xtreg pharvis lnhhexp illness, re vce(robust) 

. estimates store RE_rob 

. qui xtreg pharvis lnhhexp illness, re vce(cluster hh) 

. estimates store RE_hh 

. qui xtreg pharvis lnhhexp illness, re vce(cluster commune) 
. estimates store RE_comm 

. estimates table RE_MLE RE_def RE_rob RE_hh RE_comm RE_hh, 


> keep(lnhhexp illness _cons) b(/410.4f) se stats(N) eq(1) 

Variable RE_MLE RE_def RE_rob RE_hh RE_comm 
inhhexp 0.0192 0.0183 0.0183 0.0183 0.0183 
0.0154 0.0168 0.0150 0.0150 0.0217 
illness 0.6179 0.6165 0.6165 0.6165 0.6165 
0.0082 0.0083 0.0167 0.0167 0.0296 
_cons 0.0816 0.0858 0.0858 0.0858 0.0858 
0.0413 0.0448 0.0404 0.0404 0.0583 
N 27753 27753 27753 27753 27753 


Legend: b/se 


The first column gives the RE—MLE estimates (with default standard errors) 
that are close to the subsequent RE estimates. The default standard errors for 
RE estimates, given in the second column, substantially understate the 
standard error of illness. The third and fourth columns are identical and 
demonstrate that vce (robust) is equivalent to the vce (cluster clustvar) 


option, provided that clustvar is the variable given in the xtset command. 
These are the standard errors that would most often be used. The fifth 
column clusters at the broader level of commune. This is possible because 
households are nested within villages and leads to even larger standard 
errors. 


6.6 Fixed-effects estimator for clustered data 


The FE model is less restrictive than the RE model because it allows a 
component of the error, one that is cluster specific, to be correlated with the 
regressors. This reduces the likelihood that estimators are inconsistent. 


6.6.1 Cluster-specific FE model 


The cluster-specific FE model specifies that for individual ; in cluster j 
Yee XO POG gin OS hea GS ies (6.5) 


where the cluster-specific effect &g is a fixed effect, one potentially 
correlated with the regressors. 


Like a random effect, which instead assumes that g is purely random, 
this can at least partially control for within-cluster correlation. 


More substantively, it relaxes the RE assumption that the cluster-specific 
effect is uncorrelated with the regressors. If &g is indeed correlated with Xi, 
then both the OLS and RE estimators are inconsistent. The FE estimator, 
presented below, is consistent provided only that E (£€gi|Xgi, &g) = 0. That 
is, after we control for regressors and a cluster-specific effect that may be 
correlated with the regressors, any remaining error is assumed to be 
uncorrelated with the regressors. The RE estimator requires the much 
stronger conditions that €gi and g be uncorrelated with the regressors. 


Thus, the FE estimator is widely used in microeconometrics studies with 
clustered data. 


6.6.2 FE estimator 


The FE estimator of (3, the FGLS estimator in the FE model assuming that the 
error €:7 is 1.1.d., can be shown to be equivalent to the OLS estimator in the 


transformed model, called the within model or mean-differenced model, 
(Gigi 9,) =a Xg) BF ge E,) (6.6) 


where, for example, x = No sa Kyr Note that variables that are 


invariant within cluster drop out because Vgi = Vg for all į implies Tg = £y. 
This is a regression of the within-cluster variation of y on the within-cluster 
variation of x. Thus, the FE estimator is also called the within estimator. 


The FE estimator is consistent, regardless of whether the cluster-specific 
effects are fixed or random. But there can be a loss of efficiency because the 
FE estimator uses only within-cluster variation in the data and because 
coefficients of cluster-invariant regressors are not identified. 


There are several ways to obtain the FE estimator of G. The preferred 
method is to use the xtreg, fe command, with cluster—robust standard 
errors. 


. * FE estimation with hh FE using xtreg, fe 
. xtreg pharvis lnhhexp illness, fe vce(robust) 
note: Inhhexp omitted because of collinearity. 


Fixed-effects (within) regression Number of obs = 27,753 
Group variable: hh Number of groups = 5,739 
R-squared: Obs per group: 
Within = 0.1541 min = 1 
Between = 0.2069 avg = 4.8 
Overall = 0.1816 max = 19 
F(1,5738) z 1155.53 
corr(u_i, Xb) = 0.0160 Prob > F = 0.0000 


(Std. err. adjusted for 5,739 clusters in hh) 


Robust 
pharvis | Coefficient std. err. t P>|t| [95% conf. interval] 
1nhhexp O (omitted) 
illness -6089755 0179147 33.99 0.000 .5738559 -644095 
_cons . 1323569 .0111401 11.88 0.000 1105181 . 1541957 


sigma_u 82219146 
sigma_e 1.0606587 
rho . 37534725 (fraction of variance due to u_i) 


Note that the coefficient of 1nhhexp is not identified. This is because 
Inhhexp is the same for every individual in the household, so there is no 
within-cluster variation in Inhhexp, SO (4g; — Zg) = 0 for this regressor. The 
coefficient of illness is 0.609, similar to the OLS estimate of 0.624. 


The estimated intercept is nonzero because the xtreg, fe command 
modifies (6.6) to instead estimate 


(Ysi — Yg +9) =a + T +X) B+ (Eg: — Fy +2) (6.7) 


where y and x are the overall means of y and x. This yields the same slope 
estimates. 


If the coefficient of Inhhexp is of intrinsic interest, then clearly the FE 
estimator is of no use in this application. But if the coefficient of illness is 
of intrinsic interest, then the FE estimator provides consistent estimates even 
if illness is correlated with the household effect &g. Consistency requires 
only that, after we control for other regressors and the cluster-specific effect 
Qg, illness is uncorrelated with the remaining error Egi. 


6.6.3 Cluster—robust standard errors for the FE estimator 


Similar to the discussion for the RE estimator, even after inclusion of the 
cluster-specific fixed effects &g, there is usually remaining 
heteroskedasticity and within-cluster error correlation, so inference should 
be based on cluster—robust standard errors. 


We compare VCEs for the FE estimator, estimated using xtreg, fe, in 
increasing order of control for clustering: no control, household clusters, and 
village clusters. 


. * FE estimation with hh RE and various standard errors 


. qui xtreg pharvis lnhhexp illness, fe 


. estimate store FE_def 


. qui xtreg pharvis lnhhexp illness, fe vce(robust) 


. estimate store FE_rob 


. qui xtreg pharvis lnhhexp illness, fe vce(cluster hh) 


. estimate store FE_hh 


. qui xtreg pharvis lnhhexp illness, fe vce(cluster commune) 


. estimate store FE_comm 


. qui regress pharvis lnhhexp illness, vce(cluster hh) 


. estimates store OLS_hh 


. estimates table FE_def FE_rob FE_hh FE_comm OLS_hh, b(%10.4f) se stats(N) 


Variable FE_def 
1Inhhexp (omitted) 
illness 0.6090 

0.0096 

_cons 0.1324 
0.0087 

N 27753 


FE_rob 
(omitted) 
0.6090 
0.0179 
0.1324 
0.0111 


27753 


FE_hh 
(omitted) 
0.6090 
0.0179 
0.1324 
0.0111 


27753 


FE_comm 

(omitted) 
0.6090 
0.0276 
0.1324 
0.0172 


27753 


OLS_hh 


. 0247 
.0140 
.6235 
.0183 
.0590 
. 0367 


oo0oo0oo0o0o 


27753 


Legend: b/se 


The default standard errors, given in the first column, substantially 
understate the standard error of illness. The second and third columns are 
identical and demonstrate that the vce (robust) is equivalent to the 
vce (cluster Clustvar) option, provided that clustvar is the variable given 
in the xtset command. These are the standard errors that would most often 
be used. The fourth column clusters at the broader level of commune. This is 
possible because households are nested within communes and leads to even 


larger standard errors. 


The FE estimator is often less efficient than the OLS estimator, sometimes 
substantially so, because it uses only within variation of the data. In this 
application, however, the third and fifth columns, which both cluster on 
household, display little difference in the standard error of illness. 


6.6.4 Alternative methods for computing the FE estimator 


There are alternative methods for computing the FE estimator in the model 
(6.5). Even though these lead to exactly the same estimates of 3 and similar 
default standard errors, we strongly recommend that the xtreg, fe 
command be used. First, xtreg, fe (and areg) are computationally faster 
than using regress with an indicator variable for each in individual. Second, 
if there are few observations per cluster, then commands other than xtreg, 
fe overstate standard errors because of the way degrees of freedom are 
calculated. 


One approach is to directly estimate both œg, g = 1,..., Ng, and 8 in the 
model (6.5). This is the least-squares dummy-variable (LSDV) estimator, 
obtained by the OLS regression 


G 
dei = (>. onda +x) B+ Egi (6.8) 
h=1 
where there are G individual-specific indicator variables dp, gi, g =1,...,G, 


with dp, gi = 1 for the gith observation if h = g and dr,gi = 0 otherwise. 


This brute-force method involves inversion of a (G + K) x (G+ K) 
matrix, problematic in the current example with G + K = 4168. Thus, we 
illustrate the method below using only a subset of the households. 


The large matrix inversion can be avoided using an analytical result, 
obtained through partitioned matrix inversion, that is applicable when the 
regressors include a set of dummy variables, such as the G dummies in (6.8). 
The areg command (and xt reg, fe) implements this simplification. 


To contrast these various methods, we create a dataset with just two 
observations per cluster, here household. We first create a variable for the 
number of persons in each household. 


. * Generate integer-valued person identifiers 
. sort hh 


. by hh: generate numpersonsinhh = _N 


We then restrict analysis to the 441 households with exactly 2 members. 
The FE model is fit using 1) the regress command applied to the LSDv 
model, 2) the areg command, and 3) the xtreg, fe command. 


For the regress command, the 441 dummy variables are entered as 
i.hh, leading to a large number of regressors and hence a large value for 
matsize. For the current example and using Stata 17, there is no problem. In 
versions of Stata before version 17, one may need to use the set matsize 
command to increase matsize from program defaults. In Stata 17, the set 
matsize command is redundant, but matrix sizes are limited by the edition 
of Stata; the help limits command gives these limits. 


Both default and cluster—robust standard errors are obtained. We have 


. * Various standard errors for hh FE estimation using xtreg, reg, and areg 


. preserve 


. keep if numpersonsinhh == 
(26,871 observations deleted) 


. qui regress pharvis lnhhexp illness i.hh 
. estimates store LSDV_def 


. qui areg pharvis lnhhexp illness, absorb(hh) 


. estimates store AREG_def 


. qui xtreg pharvis lnhhexp illness, fe 


. estimates store XTREG_def 


. qui regress pharvis Inhhexp illness i.hh, vce(cluster hh) 


. estimates store LSDV_clu 


. qui areg pharvis Inhhexp illness, absorb(hh) vce(cluster hh) 


display "AREG se times square-root(2) = 


AREG se times square-root(2) = 


. estimates store AREG_clu 


. 12817079 


. qui xtreg pharvis lnhhexp illness, fe vce(cluster hh) 


. estimates store XTREG_clu 
. estimates table LSDV_def AREG_def XTREG_def LSDV_clu AREG_clu XTREG_clu, 


_se[illness]/sqrt (2) 


> keep(illness Inhhexp _cons) b(%48.4f) se stats(N r2 df_m) varwidth(8) 
Variable LSDV_def AREG_def XTREG_def LSDV_clu AREG_clu XTREG_clu 
illness 0.8402 0.8402 0.8402 0.8402 0.8402 0.8402 
0.0949 0.0949 0.0949 0.1813 0.1813 0.1282 
1lnhhexp -0.3722 (omitted) (omitted) -0.3722 (omitted) (omitted) 

0.4035 0.0214 

_cons 2.3636 0.0705 0.0705 2.3636 0.0705 0.0705 
1.3794 0.1095 0.1095 0.1930 0.1778 0.1257 
N 882 882 882 882 882 882 
r2 0.6188 0.6188 0.1512 0.6188 0.6188 0.1512 
df_m 441.0000 1.0000 441.0000 0.0000 1.0000 0.0000 


. restore 


Legend: b/se 


The areg and xtreg, fe commands yield exactly the same coefficient 
estimates. The LSDV estimates fit using the regress command provide the 
same fit and the same coefficient for illness. The apparent differences in 
the other coefficients arise for the following reason. The LsDv intercept 
differs because the areg and xtreg, fe commands add in the grand means 
to estimate (6.7) rather than (6.6). The LSDV coefficient for Inhhexp is 
nonzero because, while 1nhhexp is not identified, the regress command 


does not know the source of nonidentification and instead dropped one of the 
household dummy variables. 


The default standard errors for illness are the same across all three 
methods, and the default standard errors for the intercept are the same using 
the areg and xtreg, fe commands. 


The important difference is in the cluster—robust standard errors. The 
areg command and LSDV estimation lead to much larger standard errors than 
the xtreg, fe command. In fact, in this example, the difference for illness 


is the multiple \/2 = 1.414. 


The reason for this stark difference is the following. The cluster—robust 
estimate of the VCE uses the finite sample correction of 
{G/(G — 1)}{(N — 1)/(N — K)}, where K is the number of identified 
regressors and N = 882 in this example. There are two identified regressors, 
the intercept and illness, so the xtreg, fe command sets K = 2 and 
N — K = 882 — 2 = 880. The areg command, and regress used for the 
LSDv model, additionally counts the 440 identified dummy variables, leading 
to K* = 442 and N — K* = 882 — 442 = 440. This is exactly one-half of 
N — K computed using the xtreg, fe command, leading to a VCE for LSDV 
and areg that is twice as large and standard errors that are ,/2 times as large. 
For cluster—robust standard error correction, the dummy variables should not 
be counted in computing the degrees of freedom correction; one should use 
xtreg, fe. 


More generally, the areg and LSDv standard errors are 
N — K)/(N — G — 1 — K) times the xtreg, fe standard errors, where 


K is the number of identified regressors in the within model (6.6). 
6.6.5 Correlated RE model 


The correlated RE model, explained more fully for the linear panel model in 
section 8.7.4, models the individual-specific effect as ag = X1, gy ies 
where ^g is an 1.1.d. error and X1,, are the within-cluster averages of those 
regressors that vary within cluster. Then the model (6.5) becomes 


Ygi = xi F Xa ag (Ng + Sya) 


This model is interpreted as an RE model in which the RE assumptions hold 
conditionally on both xgi and X1 ,g. For the current example, we obtain 


x RE estimation with hh fixed 
sort hh 


. by hh: egen avelnhhexp = mean (lnhhexp) 
. by hh: egen aveillness = mean(illness) 


. xtreg pharvis avelnhhexp aveillness lnhhexp illness, re vce(cluster hh) 
note: lnhhexp omitted because of collinearity. 


Random-effects GLS regression Number of obs = 27,753 

Group variable: hh Number of groups = 5,739 
R-squared: Obs per group: 

Within = 0.1541 min = 1 

Between = 0.2071 avg = 4.8 

Overall = 0.1818 max = 19 

Wald chi2(3) = 1370.89 

corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 


(Std. err. adjusted for 5,739 clusters in hh) 


Robust 
pharvis | Coefficient std. err. z P>lz| [95% conf. interval] 
avelnhhexp .022084 .0144866 1.52 0.127 - .0063093 .0504773 
aveillness .0311645 .0284538 1.10 0.273 - .0246039 . 0869329 
lnhhexp O (omitted) 
illness . 6089755 .0179153 33.99 0.000 .5738621 . 6440889 
_cons . 0609077 . 0393996 1.55 0.122 -.016314 . 1381295 
sigma_u . 62622406 
sigma_e 1.0606587 
rho .25848181 (fraction of variance due to u_i) 


The coefficient of illness is the same as that for the FE estimator. The 
variable Inhhexp does not vary within cluster so is omitted; with a different 
ordering of regressors, avelnhhexp may be omitted instead. 


This method can be useful in those nonlinear models for which there is 
no standard FE estimator in short panels because of the incidental parameters 
problem; see section 22.2.1. 


A variant of this specification used in the peer group literature includes 
as regressor the average of x taken over a peer group or over a reference 
group analogous to a peer group. The intention is to control for the impact of 
the average characteristics of the peer or reference group, the latter being a 
subset of the sample under study. Manski’s (1993) “linear-in-the-means” 
model includes as regressors averages of regressors of members of the peer 
group; this too can also be interpreted as controls for capturing cross- 
sectional dependence or social interdependence. 


6.6.6 Methods to compute two-way FE models 


Two-way FE models arise when fixed effects arise for two distinct reasons 
that are nonnested, such as for matched employer—employee data with 
potentially multiple observations for each employer and each employee. 


If there are few fixed effects across both dimensions, one can use 
command regress with dummy variable regressors. Suppose the identifiers 
for the two dimensions are id1 and id2. Then we can give command 
regress y x i.idl i.id2. 


If there are few fixed effects across one dimension, say, id1, and many 
across the other dimension, say, id2, the preceding method may not be 
practical because OLS regression then entails inverting a large matrix. This 
can be avoided by using the areg or xtreg, fe command, both of which 
implement the mean-differencing transformation (6.6). Then we can give 
command areg y x i.idl, absorb(id2) or the commands xtset id2 and 
xtreg y x i.idl, fe. For cluster—robust inference, it is better to use 
xtreg, fe as explained in section 6.6.4. 


If there are many fixed effects, then again estimation may no longer be 
practical because of OLS regression of the mean-differenced model requiring 
inversion of a large matrix. For example, if the smaller of the two 
dimensions has J distinct units, then we have at least a J x J matrix to 
invert. Several methods have been proposed to avoid this matrix inversion, 
and there are several community-contributed Stata programs. These include 
reg2hdfe (Guimaraes and Portugal 2010) and felsdvreg 
(Cornelissen 2008). McCaffrey et al. (2012) provide a review. 


6.7 Linear mixed models for clustered data 


In OLS regression, there is no role for heterogeneity across individuals, aside 
from the error term u;, which is a noise term assumed to be uncorrelated 
with regressors. The RE model (and the FE model) introduces some additional 
heterogeneity by allowing the intercept to vary across clusters. 


The richer linear mixed model introduces still more heterogeneity by 
permitting slope parameters to vary across clusters. This can lead to more 
efficient estimation, and additionally, this approach is especially 
advantageous in settings where heterogeneity is of intrinsic interest. For 
example, we might be interested in the variability across villages in the 
response of pharmaceutical drugs to increased illness and not just the 
average response. Similarly, one may be interested in the variability across 
classrooms and schools of individual student performance to an educational 
intervention. And it is a natural way to more accurately reflect the richer data 
structure that arises in complex surveys. 


Like the simpler RE model, the linear mixed model assumes that the 
random parameters are uncorrelated with the error term, so parameter 
estimates are inconsistent if the FE estimator is needed. If FE estimation is 
unnecessary, then linear mixed-model estimates of slope parameter means 
are consistent, and by allowing a richer specification of the random 
components of the model, the mixed linear model can lead to more efficient 
estimation. If the model for parameter heterogeneity is misspecified, 
however, then the linear mixed-model estimates of the variances and 
covariances of slope parameter estimates are inconsistent, and cluster—robust 
standard errors should be used rather than default standard errors. 


The linear mixed model is widely used in other disciplines but is not 
often used by microeconometricians despite growing interest in applied 
economics research in heterogeneous responses. Reasons for this lack of use 
may include the microeconometrics focus on heterogeneity at the individual 
level, rather than at a cluster level, the desire to minimize distributional 
assumptions, and a focus on controlling for endogeneity with data that are 
usually observational. 


6.7.1 Linear mixed model 


The mixed linear model specifies a model for the conditional variance and 
covariances of Ygi that can depend on observable variables. 


For individual ; in cluster g the conditional mean of Ygi is specified to be 
X4iß, where the regressors Xgi include an intercept. 


The simplest version of the mixed linear model is a two-level model, 
where the two levels are the individual and the cluster. Then the observed 
value Ygi equals this conditional mean plus an error term 2,;Ug + Egi, where 
Zgi are observable variables and Ug and €gi are 1.1.d. normally distributed 
random variables. We have 


Ygi = Xp b + ZqiUg + Egi (6.9) 


where uy ~ N(0, Xu) and eg; ~ N(0, a2). The variances and covariances 
in X, are called RE parameters. 


The mixed models literature refers to the conditional mean parameters 8 
as fixed effects, which are contrasted with the error terms Ug, which are 
called random effects. We minimize use of this terminology because this is a 
very different use of the term “fixed effects” from that typically used by 
economists. Indeed, if the fixed effects defined in section 6.6 are present, 
then the estimators of this section are inconsistent. 


Specific choices of Zgi lead to some standard models. OLS corresponds to 
Zgi = 0. The RE model of section 6.5 corresponds to Z,; = 1 because only 
the intercept is random. A model often called the random-coefficients model 
sets Zgi = Xgi so that, for the regressor Xgi, both the intercept and slope 
coefficients are random. The hierarchical linear models framework (see 
section 6.7.7) leads to further choices of Zgi. 


6.7.2 Linear mixed-model estimator 


Mixed linear models can have several levels. Regardless of the number of 
levels, the general form of the model upon stacking all NV observations 
across all levels is 


y = XB+ Zu+e 


where u and e€ are zero-mean multivariate normal errors that are independent 
of each other. 


The combined error term Zu + e is therefore multivariate normal, with 
variance matrix that, for the two-level model (6.9), is parameterized by Xu 
and g?. This is a natural setting for FGLS estimation, or one can use the MLE 
given the assumption of normality. 


The mixed command estimates by ML, the default, or by an 
asymptotically equivalent variation of ML, called restricted maximum 
likelihood, that produces variance estimates that are unbiased in balanced 
samples. The dfmethod() option specifies different methods for computing 
the degrees of freedom. 


While homoskedasticity and normality of the errors Ug and €gi are 
assumed, it is not necessary for consistency of 3, because consistency 
requires essentially that E(y|X, Z) = XB, which in turn requires that Ug 
and €gi have mean zero. 


Despite this robustness of the mixed estimator of B, homoskedasticity of 
the errors Ug and €gi is needed to obtain consistent estimates of the variances 
and covariances measuring heterogeneity. It is also needed for default 
standard errors to be correct. Otherwise, one should obtain cluster—robust 
standard errors that are valid even if the errors are not i.i.d. These robust 
standard errors are valid provided the errors are independent across clusters 
and the number of clusters is large. 


6.7.3 The mixed command 


The mixed command fits a multilevel mixed-effects model. The dependent 
variable and regressors in (6.9) are defined, followed by two vertical bars 
(| |) and a definition of the portion of the model for the random effects Ug. 


For example, in a single-level model, the command mixed y x | | id: z, 
mle is used if we want to regress Ygi on an intercept and Ygi; the variable id 
identifies cluster g; the random effects vary with an intercept and variable 
Zgi; and estimation is by ML. 


The general syntax of the command is 


mixed depvar fe_equation [11 re_equation | [1 re_equation sia] 
|, options | 


This general form allows for additional levels of clustering; see 
section 6.7.7. 


The dependent variable Ygi is given in depvar. The regressors are defined 
in fe_equation, which has the syntax 


| indepvars | lif | [ in | la fe_options | 


where indepvars defines the regressors Xgi, and the fe option noconstant is 
added if no intercept is to be included. 


The RE model is given in re_equation, which for random coefficients and 
intercepts has the syntax 


levelvar: [ varlist | e re_options | 


where /evelvar is the individual unit identifier, varlist gives the variable Zgi, 
and the noconstant re_option is added if there is to be no random intercept. 
For random effects among the values of a factor variable, re equation has 
the alternative syntax 


levelvar: R.varname [ re_options | 


The re_option covariance (vartype) places structure on X„, where 
vartype includes independent (the default) with ¥,, diagonal and 
unstructured with no structure placed on X- 


6.7.4 Random intercept model 


The random-intercept model restricts 2/,;Ug = Ug, So Ug is a scalar and 2g: is 
a scalar equal to one, and is identical to the RE model of section 6.5.3. 


The model can be fit by using the mixed command. For a random 
intercept varying by hh, the RE portion of the model is defined simply as hn:. 
Here hh identifies the cluster, and no variables are listed after the colon 
because the default is to include a random intercept. We use the mie and 
vce (robust) options to obtain 


. * Random hh intercept model estimated using mixed 
. mixed pharvis lnhhexp illness || hh: , mle vce(robust) 


Performing EM optimization ... 


Performing gradient-based optimization: 

Iteration 0: log pseudolikelihood = -43424.009 
Iteration 1: log pseudolikelihood = -43423.989 
Iteration 2: log pseudolikelihood = -43423.989 


Computing standard errors ... 


Mixed-effects regression Number of obs = 27,753 
Group variable: hh Number of groups = 5,739 

Obs per group: 
min = 1 
avg = 4.8 
max = 19 
Wald chi2(2) = 1358.18 
Log pseudolikelihood = -43423.989 Prob > chi2 = 0.0000 
(Std. err. adjusted for 5,739 clusters in hh) 

Robust 
pharvis | Coefficient std. err. z P>lzl [95% conf. interval] 
lInhhexp .0192194 .0147418 1.30 0.192 -.009674 .0481128 
illness .6178673 .0167863 36.81 0.000 . 5849668 .6507679 
_cons .0815646 .0393568 2.07 0.038 . 0044266 . 1587026 
Robust 
Random-effects parameters Estimate std. err. [95% conf. interval] 
hh: Identity 

var (_cons) . 2743023 .0337261 . 2155621 . 3490491 
var (Residual) 1.146892 .0849008 .9919978 1.325972 


The coefficient estimates are similar (within 5%) to those from the xtreg, 
re command in section 6.5. And, as shown below, they are identical to those 
obtained from the xtreg, mle command. The cluster-specific error U; has 
estimated variance 0.274, and the individual-specific error €gi has variance 
1.147. 


For this example, the rem1 option of mixed yields coefficient estimates 
that are the same to the fourth significant digit. The rem1 option does not 
support robust standard errors. 


The default is for statistical inference to be based on standard normal p- 
values and critical values. In the special case where default standard errors 
are used, rather than robust standard errors, the dfmethod() option uses the t 
distribution with several different methods available to compute the degrees 
of freedom. 


6.7.5 Cluster—robust standard errors for mixed estimator 


We compare cluster—robust standard errors following the mixed command at 
broadening levels of clustering. Additionally, we demonstrate equivalence 
with the xtreg, mle command. We have 


* Robust standard errors after mixed (and equivalence with xtreg, mle) 
. qui xtreg pharvis lnhhexp illness, mle 


estimates store remle_def 

. qui mixed pharvis lnhhexp illness || hh: , mle 
estimates store mixed_def 

. qui mixed pharvis lnhhexp illness || hh: , mle vce(robust) 
estimates store mixed_rob 

. qui mixed pharvis lnhhexp illness || hh: , mle vce(cluster hh) 
estimates store mixed_hh 

. qui mixed pharvis lnhhexp illness || hh: , mle vce(cluster commune) 


estimates store mixed_comm 


. estimates table remle_def mixed_def mixed_rob mixed_hh mixed_comm, 


> b(%410.4f) se stats (N) 
Variable remle_def mixed_def mixed_rob mixed_hh mixed_comm 
pharvis 
lnhhexp 0.0192 0.0192 0.0192 0.0192 0.0192 
0.0154 0.0154 0.0147 0.0147 0.0215 
illness 0.6179 0.6179 0.6179 0.6179 0.6179 
0.0082 0.0082 0.0168 0.0168 0.0303 
_cons 0.0816 0.0816 0.0816 0.0816 0.0816 
0.0413 0.0413 0.0394 0.0394 0.0575 
sigma_u 
_cons 0.5237 
0.0101 
sigma_e 
_cons 1.0709 
0.0051 
lns1_1_1 
_cons -0.6468 -0.6468 -0.6468 -0.6468 
0.0193 0.0615 0.0615 0.0776 
lnsig_e 
_cons 0.0685 0.0685 0.0685 0.0685 
0.0048 0.0370 0.0370 0.0486 
Statistics 
N 27753 27753 27753 27753 27753 


Legend: b/se 


The first two columns demonstrate that the xtreg, mle command yields the 
same estimates and standard errors for the estimate of G as the mixed 
command. The error variances are also identical because, for example, 


In(0.5237) = 


—0.6468. 


The standard errors increase as the level of clustering broadens, and 
because the clustering was on household, the vce (robust) option yields the 


same standard errors as the vce (cluster hh) option. 


6.7.6 Random slopes model 


Although the regression parameters G in (6.9) are consistently estimated if u 
and €g: are not 1.1.d., the estimates of the variance parameters X, and ce 
(reported here as var (_ cons) and var (Residual) ) are inconsistently 
estimated. And there is efficiency loss in estimation of 8. 


This provides motivation for using a richer model for the RE portion of 
the model. 


For our application, we let the random effects depend on both an 
intercept and on illness, and we let X, be unstructured. Using default 
standard errors, we obtain 


. * Random-slopes model estimated using mixed 


. mixed pharvis lnhhexp illness || hh: illness, mle covar (unstructured) 
> difficult noheader nolog 
pharvis | Coefficient Std. err. z P>|zl [95% conf. interval] 
lnhhexp 0035616 .0098921 0.36 0.719 -.0158265 . 0229497 
illness . 7687124 .015792 48.68 0.000 . 7377607 . 7996641 
-cons .0649558 .0269171 2.41 0.016 .0121993 .1177123 
Random-effects parameters Estimate Std. err. [95% conf. interval] 


hh: Unstructured 


var (illness) .8119174 .0236232 . 7669121 . 8595637 

var (_cons) 0001208 . 00005 . 0000536 .0002721 
cov(illness,_cons) . 009903 . 0020559 .0058735 .0139324 

var (Residual) . 7092376 .0066412 . 6963399 . 7223742 

LR test vs. linear model: chi2(3) = 10653.94 Prob > chi2 = 0.0000 


Note: LR test is conservative and provided only for reference. 


The first set of output gives the estimated coefficients of the intercept and 
slope parameters. From the second set of output, all the variance components 
parameters are statistically significantly different from 0 at 5%. Furthermore, 
the heterogeneity is very large. For example, the slope coefficient for 
illness has standard deviation across households of ,/0.812 = 0.901, 
which is large compared with the mean of 0.769. 


Because default standard errors were used, the output includes a joint test 
that the three RE parameters are zero. The joint-test statistic has a 
nonstandard and complicated distribution because a variance must be 
positive, so the y2(3) distribution is an approximation that overstates the p- 
values. Because this conservative value of p is less than 0.05 in this example, 
we can definitely reject the null hypothesis at level 0.05. The likelihood-ratio 
joint test statistic is not computed if the vce (robust) option is used. Instead, 
a Wald test needs to be implemented; see section 23.4 for an example. 


Numerical problems can be encountered when richer models for the 
random effects are specified. For example, dropping the 
covar (unstructured) option leads to var (_cons) going to its boundary 
value of zero. 


Interest can lie in predicting the random effects for each group and in 
predicting the conditional mean when conditioning on both regressors. For 
an individual observation, ygi = x; + 2,;Wg + Egi. The reffects option 
of the predict postestimation command computes best linear unbiased 
predictions Uy, and the fitted option computes Jg; = xyi b + Zi ¡üg The 
default xb option of predict instead computes Fyi = Xg b. 

In the linear mixed model, E (ygi|Xgi) = X,,3, so the margins command 
computes marginal means xX, B and marginal effects 8. 


6.7.7 Hierarchical linear models 


The mixed model presented so far has been a two-level model. Mixed 
models or hierarchical models allow additional levels. 


A three-level example is to suppose that person ; is in household g, 
which is in village k, and that the model is a variance-components model 
with 


/ 
Ygki = XoniP | Ug F Uk T Egki 


where Ug, Ug, and €gki are 1.1.d. errors. 


The model can be fit using the mixed command, with the variance 
components ordered beginning with the broadest level, here commune 
because households are nested in villages. The difficult option was added 
to ensure convergence of the iterative process. We obtain 


. * Hierarchical linear model with household and commune variance components 


. mixed pharvis lnhhexp illness || commune: || hh:, mle difficult 
> nolog vce(cluster commune) 
Mixed-effects regression Number of obs = 27,753 


Grouping information 


No. of Observations per group 
Group variable groups Minimum Average Maximum 
commune 194 51 143.1 206 
hh 5,739 1 4.8 19 
Wald chi2(2) = 406.45 
Log pseudolikelihood = -43200.141 Prob > chi2 = 0.0000 


(Std. err. adjusted for 194 clusters in commune) 


Robust 
pharvis | Coefficient std. err. z P>|z\ [95% conf. interval] 
lnhhexp -.0409411 0206685 -1.98 0.048 -.0814506 -.0004317 
illness .6140699 . 0305092 20.13 0.000 . 5542729 . 6738668 
_cons . 2353656 0568166 4.14 0.000 . 1240072 . 346724 


Robust 
Random-effects parameters Estimate std. err. [95% conf. interval] 
commune: Identity 
var (_cons) .065903 .0107784 .0478289 . 090807 
hh: Identity 
var (_cons) . 2042623 .0335477 . 1480431 . 2818307 
var (Residual) 1.148946 . 1117647 . 9495068 1.390275 


The coefficient of 1nhhexp has changed sign compared with the RE 
estimates. Both variance components are statistically significant at level 
0.05, with the household-specific error having much larger variance than the 
commune-specific error. 


The mixed command allows the variance components to additionally 
depend on regressors, as already demonstrated in the simpler case of a two- 
level model. 


6.7.8 Two-way RE model 


The mixed command is intended for multilevel models where the levels are 
nested. We now consider a two-way RE model that has the error 

Qg + Yk + Egki, where all three errors are 1.1.d. and now g is not nested in ķ 
and & is not nested in g. 


Rabe-Hesketh and Skrondal (2022, 488) explain how to nonetheless use 
mixed to estimate the parameters of the two-way RE model, using a result of 
Goldstein (1987) that shows how to rewrite covariance models with 
nonnested structure as nested models. 


Let the variables defining g and ķ be levi and lev2. Then, we specify 
the two levels as || all: R.lev2 || levi:. The entry aii: defines each 
combination (g, k) to be a separate group (thereby nesting 1ev1). We then 
add rR. 1ev2 because the R. prefix ensures the desired correlation pattern due 
to Yk by defining a factor structure in g with independent factors with 
identical variance (see [ME] mixed). At the second level, the RE equation 
defines the covariance structure for %g by simply using 1ev1:. This ordering 
is optimal when there are more different values of 1ev1 than 1ev2. If this is 
not the case, then computation is faster if the roles of 9 and & are reversed, 
and the random effects are specified as || all: R.lev1 || lev2:. 


For purely illustrative purposes, we cluster in two nonnested 
dimensions: commune and illness. There are 194 communes and only 10 
distinct values of illness, so it is fastest to use the ordering || all: 
R.illness || commune:. We obtain 


. * Two-way nonnested errors - illness and commune 
. mixed pharvis lnhhexp illness || _all: R.illness || commune: , 


> mle difficult nolog 


Mixed-effects ML regression 


Grouping information 


Number of obs = 27,753 


. of Observations per group 
Group variable groups Minimum Average Maximum 
-all 1 27,753 27,753.0 27,753 
commune 194 51 143.1 206 
Wald chi2(2) = 14.91 
Log likelihood = -43194.504 Prob > chi2 = 0.0006 
pharvis | Coefficient Std. err. Zz P>lz| [95% conf. interval] 
lnhhexp -.0383132 .0146749 -2.61 0.009 -.0670754 -.0095509 
illness .3714657 . 1306934 2.84 0.004 . 1153114 .62762 
_cons -6637303 . 5692554 1.17 0.244 -.4519898 1.77945 
Random-effects parameters Estimate Std. err [95% conf. interval] 
all: Identity 
var (R.illness) .8275655 . 5503089 . 2247895 3.046693 
commune: Identity 
var (_cons) . 0656064 -007628 .0522369 .0823976 
var (Residual) 1.295374 0110375 1.273921 1.317189 


LR test vs. linear model: chi2(2) = 1902.37 


Prob > chi2 = 0.0000 


Note: LR test is conservative and provided only for reference. 


The coefficient of 1nhhexp has changed sign compared with the RE 
estimates. Both variance components are statistically significant at level 
0.05, with the illness-specific error having much larger variance than the 


commune-specific error. 


The same results were obtained with the ordering || all: R.commune 
|| illness:, but computation took much longer. 


In the two-way nonnested case, the variance matrix of the combined 
errors Qg + Yk + Egki is not block diagonal, so it is not possible to compute 


robust standard errors. 


6.8 Systems of linear regressions 


In this section, we extend GLS estimation to a system of linear equations with 
errors that are correlated across equations for a given individual but are 
uncorrelated across individuals. Then cross-equation correlation of the errors 
can be exploited to improve estimator efficiency. This multivariate linear 
regression model is usually referred to in econometrics as a set of SUR 
equations. It arises naturally in many contexts in economics—a system of 
demand equations is a leading example. The GLS methods presented here can 
be extended to systems of simultaneous equations (three-stage least-squares 
estimation presented in section 7.10), panel data (chapter 8), and systems of 
nonlinear equations (section 18.11.2). 


We also illustrate how to test or impose restrictions on parameters across 
equations. This additional complication can arise with systems of equations. 
For example, consumer demand theory may impose symmetry restrictions. 


6.8.1 Seemingly unrelated regressions model 
The model consists of m linear regression equations for NV individuals. The 
jth equation for individual ; is yji = x9; + uji. With all observations 


stacked, the model for the jth equation can be written as y; = Xj, + uj. 
We then stack the m equations to give the SUR model 


X 0 > 0 B, 


yı ui 
y2 0 X : Bo u2 
. = . F . 
0 
Ym 0 0 Xn Bm Um 


This has a compact representation: 


y=Xß+u 


The error terms are assumed to have zero mean and to be independent 
across individuals and homoskedastic. The complication is that for a given 
individual, the errors are correlated across equations, with 
E(uzjuij/ |X) = cj; and oj; 4 0 when j Æ j’. It follows that the N x 1 
error vectors Uj, j = 1,...,m, satisfy the assumptions 1) E(u,;|X) = 0; 2) 
E(uju;|X) = 0;;Ly; and 3) E(ujuy,|X) = ojj In, j 4 7’. Then for the 
entire system, Q = E(uu’) = © & Iy, where X is an m x m positive- 
definite matrix with 77’th entry Tjj’ and & denotes the Kronecker product of 
two matrices. 


OLS applied to each equation yields a consistent estimator of 3, but the 
optimal estimator for this model is the GLs estimator. Using 
Q7! =D! & Iy, because Q = X @ Iy, we see the GLs estimator is 


Bats = {X' (£ @Iy) x} {X' (E @ In) y} (6.10) 


with a VCE assuming homoskedastic errors independent over ; that is given 
by 


Var (8) = {X (87 @Iy) X} 


FGLS estimation is straightforward, and the estimator is called the SUR 
estimator. We require only estimation and inversion of the m x m matrix X. 
Computation is in two steps. First, each equation is estimated by OLs, and the 
residuals from the m equations are used to estimate $X, using 
ü, = yj — XB, and Oj; = uu; /N. Second, § is substituted for X in 
(6.10) to obtain the FGLS estimator Br cis: An alternative is to further iterate 
these two steps until the estimation converges, called the iterated FGLS 
estimator. Although asymptotically there is no advantage from iterating, in 
finite samples there may be. Asymptotic theory assumes that m is fixed 
while N — oo. 


There are two cases where FGLS reduces to equation-by-equation OLS. 
The first is the obvious case of errors uncorrelated across equations, so X is 
diagonal. The second case is less obvious but can often arise in practice. 
Even if © is nondiagonal, if each equation contains exactly the same set of 
regressors, so X; = Xy for all j and j’, then it can be shown that the FGLS 
systems estimator reduces to equation-by-equation OLS. 


6.8.2 The sureg command 


The sur estimator is performed in Stata by using the command sureg. This 
command requires specification of dependent and regressor variables for 
each of the m equations. The basic syntax for sureg is 


sureg (depvar1 varlist1) (depvar2 varlist2) ... (depvarN varlistN) [ af | [ in | | weight | 


where each pair of parentheses contains the model specification for each of 
the m linear regressions. The default is two-step SUR estimation. Specifying 
the isure option causes sureg to produce the iterated estimator. For robust 
VCE, see section 6.8.4. 


6.8.3 Application to two categories of expenditures 


The application of SUR considered here involves two dependent variables 
that are the logarithm of expenditure on prescribed drugs (1drugexp) and 
expenditure on all categories of medical services other than drugs 
(1totothr). 


This data extract from the Medical Expenditure Panel Survey is similar 
to that studied in chapter 3 and covers the Medicare-eligible population of 
those aged 65 years and more. The regressors are socioeconomic variables 
(educyr and a quadratic in age), health-status variables (actlim and 
totchr), and supplemental insurance indicators (private and medicaid). We 
have 


* Summary statistics for seemingly unrelated regressions example 
clear all 


. qui use mus206mepssur 


summarize ldrugexp ltotothr age age2 educyr actlim totchr medicaid private 


Variable Obs Mean Std. dev. Min Max 
ldrugexp 3,285 6.936533 1.300312 1.386294 10.33773 
ltotothr 3,350 7.537196 1.61298 1.098612 11.71892 
age 3,384 74.38475 6.388984 65 90 

age2 3,384 5573.898 961.357 4225 8100 
educyr 3,384 11.29108 3.7758 (0) 17 
actlim 3,384 . 3454492 .4755848 (0) 1 
totchr 3,384 1.954492 1.326529 (0) 8 
medicaid 3,384 . 161643 . 3681774 (0) 1 
private 3,384 .5156619 . 4998285 (0) 1 


The parameters of the SUR model are estimated by using the sureg 
command. Because SUR estimation reduces to OLS if exactly the same set of 
regressors appears in each equation, we omit educyr from the model for 
ldrugexp, and we omit medicaid from the model for 1totothr. We use the 
corr option because this yields the correlation matrix for the fitted residuals 
that is used to form a test of the independence of the errors in the two 
equations. We have 


. * SUR estimation of a seemingly unrelated regressions model 
. sureg (ldrugexp age age2 actlim totchr medicaid private) 
> (1ltotothr age age2 educyr actlim totchr private), corr 


Seemingly unrelated regression 


Equation Obs Params RMSE "R-squared" chi2 P>chi2 
ldrugexp 3,251 6 1.133657 0.2284 962.07 0.0000 
ltotothr 3,251 6 1.491159 0.1491 567.91 0.0000 
Coefficient Std. err. z P>|z| [95% conf. interval] 
ldrugexp 
age . 2630418 .0795316 3.31 0.001 .1071627 .4189209 
age2 -.0017428 .0005287 -3.30 0.001 -.002779 -.0007066 
actlim . 3546589 .046617 7.61 0.000 . 2632912 . 4460266 
totchr -4005159 0161432 24.81 0.000 . 3688757 - 432156 
medicaid . 1067772 .0592275 1.80 0.071 - .0093065 . 2228608 
private .0810116 .0435596 1.86 0.063 - .0043636 . 1663867 
_cons -3.891259 2.975898 -1.31 0.191 -9.723911 1.941394 
ltotothr 
age . 2927827 . 1046145 2.80 0.005 .087742 . 4978234 
age2 -.0019247 .0006955 -2.77 0.006 -.0032878  -.0005617 
educyr .0652702 .00732 8.92 0.000 . 0509233 .0796172 
actlim . 7386912 . 0608764 12.13 0.000 .6193756 . 83580068 
totchr . 2873668 .0211713 13.57 0.000 . 2458719 . 3288618 
private . 2689068 .055683 4.83 0.000 .1597701 . 3780434 
_cons -5.198327 3.914053 -1.33 0.184 -12.86973 2.473077 
Correlation matrix of residuals: 
ldrugexp ltotothr 
ldrugexp 1.0000 
ltotothr 0.1741 1.0000 
Breusch-Pagan test of independence: chi2(1) = 98.590, Pr = 0.0000 


There are only 3,251 observations in this regression because of missing 
values for Ldrugexp and 1totothr. The lengthy output from sureg has three 


components. 


The first set of results summarizes the goodness-of-fit for each equation. 
For the dependent variable 1drugexp, we have R2 — 0.23. A test for joint 
significance of all regressors in the equation (aside from the intercept) has a 
value of 962.07 with a P-value of p = 0.000 obtained from the y?(6) 


distribution because there are 6 regressors. The regressors are jointly 
significant in each equation. 


The middle set of results presents the estimated coefficients. Most 
regressors are statistically significant at the 5% level, and the regressors 
generally have a bigger impact on other expenditures than they do on drug 
expenditures. As you will see in exercise 7 at the end of this chapter, the 
coefficient estimates are similar to those from OLS, and the efficiency gains 
of SUR compared with OLS are relatively modest, with standard errors reduced 
by roughly 1%. 


The final set of results are generated by the corr option. The errors in the 
two equations are positively correlated, with r12 = G12/VG11022 = 0.1741. 
The Breusch—Pagan Lagrange multiplier test for error independence, 
computed as Nr2, = 3251 x 0.1741? = 98.54, has p = 0.000 using the 
x?(1) distribution. (Because r12 is not exactly equal to 0.1741, the hand 
calculation yields 98.54, which does not exactly equal 98.590 in the output.) 
There is statistically significant correlation between the errors in the two 
equations, as should be expected because the two categories of expenditures 
may have similar underlying determinants. At the same time, the correlation 
is not particularly strong, so the efficiency gains to SUR estimation are not 
great in this example. 


6.8.4 Robust standard errors for the SUR estimator 


The standard errors reported from sureg impose homoskedasticity. This is a 
reasonable assumption in this example because taking the natural logarithm 
of expenditures greatly reduces heteroskedasticity. But in other applications, 
such as using the levels of expenditures, this would not be reasonable. 


There is no option available with sureg to allow the errors to be 
heteroskedastic. However, the bootstrap and jackkni fe prefixes, explained 
in chapter 12, can be used. The bootstrap prefix resamples over individuals 
and provides standard errors that are valid under the weaker assumption that 
E(ujiujnil|X) = ojj’ ip while maintaining the assumption of independence 
over individuals. As explained in section 12.3.4, it is good practice to use 
more bootstraps than the Stata default and to set a seed. We have 


. * Bootstrap to get heteroskedasticity-robust standard errors for SUR estimator 
. bootstrap, reps(400) seed(10101) nodots: sureg 

> (ldrugexp age age2 actlim totchr medicaid private) 

> (1ltotothr age age2 educyr actlim totchr private) 


Seemingly unrelated regression 


Equation Obs Params RMSE "R-squared" chi2 P>chi2 
ldrugexp 3,251 6 1.133657 0.2284 962.07 0.0000 
ltotothr 3,251 6 1.491159 0.1491 567.91 0.0000 
Observed Bootstrap Normal-based 
coefficient std. err. Zz P>|zl [95% conf. interval] 
ldrugexp 
age . 2630418 0743481 3.54 0.000 . 1173222 . 4087614 
age2 -.0017428 . 0004929 -3.54 0.000 -.0027089 -.0007766 
actlim . 3546589 . 0462869 7.66 0.000 . 2639382 . 4453795 
totchr . 4005159 .0169809 23.59 0.000 . 3672339 . 4337979 
medicaid . 1067772 .0642814 1.66 0.097 -.019212 . 2327664 
private 0810116 044791 1.81 0.071 -.0067771 . 1688002 
_cons -3.891259 2.794579 -1.39 0.164 -9.368532 1.586015 
ltotothr 
age . 2927827 . 1062298 2.76 0.006 .0845762 . 5009892 
age2 -.0019247 .0007048 -2.73 0.006 - .0033061 - .0005434 
educyr 0652702 .0075052 8.70 0.000 .0505602 .0799802 
actlim . 7386912 0619353 11.93 0.000 .6173003 . 8600821 
totchr . 2873668 .0202824 14.17 0.000 .247614 .3271196 
private . 2689068 .0548669 4.90 0.000 . 1613696 . 376444 
_cons -5.198327 3.991338 -1.30 0.193 -13.02121 2.624553 


The output shows that the bootstrap standard errors differ little from the 
default standard errors. So, as expected for this example for expenditures in 
logs, heteroskedasticity makes little difference to the standard errors. 


If data are clustered, with many clusters, then cluster—robust standard 
errors can be obtained by using a pairs cluster bootstrap. 


6.8.5 Testing cross-equation constraints 


Testing and imposing cross-equation constraints is possible using SUR 
estimation. We begin with testing. 


To test the joint significance of the age regressors, we type 


. * Test of variables in both equations 
. qui sureg (ldrugexp age age2 actlim totchr medicaid private) 
> (1ltotothr age age2 educyr actlim totchr private) 
. test age age2 
( 1) [ldrugexplage = 0 
( 2) [ltotothr]age = 0 
( 3) [ldrugexp]lage2 = 0 


( 4) [ltotothr]age2 = 0 
chi2( 4) = 16.55 
Prob > chi2 = 0.0024 


This command automatically conducted the test for both equations. 


The format used to refer to coefficient estimates when more than one 
equation is estimated is [depname] varname, where depname is the name of 
the dependent variable in the equation of interest and varname is the name of 
the regressor of interest. For example, the preceding sureg output provides 
the name 1drugexp in the first equation output. Alternatively, the command 
sureg, coeflegend provides complete names for each coefficient, such as 
[ldrugexp] age. 


A test for significance of regressors in just the first equation is therefore 


. * Test of variables in just the first equation 
. test [ldrugexplage [ldrugexp]age2 
( 1) [ldrugexplage = 0 
( 2) [ldrugexplage2 = 0 
chi2( 2) 10.98 
Prob > chi2 = 0.0041 


The quadratic in age in the first equation is jointly statistically significant at 
the 5% level. 


Now consider a test of a cross-equation restriction. Suppose we want to 
test the null hypothesis that having private insurance has the same impact on 
both dependent variables. We can set up the test as follows: 


. * Test of a restriction across the two equations 
. test [ldrugexp]private = [ltotothr] private 
( 1) [ldrugexp]private - [ltotothr] private = 0 


chi2( 1) = 8.35 
Prob > chi2 = 0.0038 


The null hypothesis is rejected at the 5% significance level. The coefficients 
in the two equations differ. 


6.8.6 Imposing cross-equation constraints 


We now obtain estimates that impose restrictions on parameters across 
equations. Usually, such constraints are based on economic theory. As an 
illustrative example, we impose the constraint that having private insurance 
has the same impact on both dependent variables. 


We first use the constraint command to define the constraint. 


. * Specify a restriction across the two equations 
. constraint 1 [ldrugexp]private = [ltotothr]private 


Subsequent commands imposing the constraint will refer to it by the number 
1 (any integer between 1 and 1,999 can be used). 


We then impose the constraint using the constraints () option. We have 


. * Estimate subject to the cross-equation constraint 
sureg (ldrugexp age age2 actlim totchr medicaid private) 
> (1ltotothr age age2 educyr actlim totchr private), constraints(1) 


Seemingly unrelated regression 


Equation Obs Params RMSE "R-squared" chi2 P>chi2 
ldrugexp 3,251 6 1.134035 0.2279 974.09 0.0000 
ltotothr 3,251 6 1.492163 0.1479 559.71 0.0000 
( 1) [ldrugexp] private - [ltotothr] private 
Coefficient Std. err. Zz P>lz| [95% conf. interval] 
ldrugexp 
age . 2707053 .0795434 3.40 0.001 . 1148031 . 4266076 
age2 -.0017907 . 0005288 -3.39 0.001 -.0028271 -.0007543 
actlim . 3575386 . 0466396 7.67 0.000 . 2661268 .4489505 
totchr . 3997819 .0161527 24.75 0.000 . 3681233 -4314405 
medicaid . 1473961 .0575962 2.56 0.010 . 0345096 . 2602827 
private . 1482936 . 0368364 4.03 0.000 .0760955 . 2204917 
_cons -4.235088 2.975613 -1.42 0.155 -10.06718 1.597006 
ltotothr 
age . 2780287 . 1045298 2.66 0.008 073154 . 4829034 
age2 -.0018298 . 0006949 -2.63 0.008 -.0031919 -. 0004677 
educyr .0703523 .0071112 9.89 0.000 .0564147 . 0842899 
actlim . 7276336 .0607791 11.97 0.000 . 6085088 . 8467584 
totchr . 2874639 .0211794 13.57 0.000 . 245953 . 3289747 
private . 1482936 . 0368364 4.03 0.000 .0760955 . 2204917 
_cons -4.62162 3.910453 -1.18 0.237 -12.28597 3.042727 


As desired, the private variable has the same coefficient (0.148) in the two 
equations. 


More generally, separate constraint commands can be typed to specify 
many constraints, and the constraints () option will then have as an 
argument a list of the constraint numbers. 


6.8.7 The suest command for seemingly unrelated equations 


The sur estimator jointly estimates equations by FGLS. Joint estimation has 
the benefit of potential efficiency gains and of allowing test of cross- 
equation restrictions. 


Often, however, we wish to test cross-equation restrictions in equations 
that are separately estimated. This can be done using the postestimation 
command suest, an acronym for seemingly unrelated estimations. The basic 
syntax for suest is 


suest namelist E options | 


where namelist is a list of names of model results previously stored using 
estimates store and options include vce (robust) and vce (cluster 
clustvar) . The model results need not necessarily be from the same 
estimation command. 


As an example, consider separate OLS estimation of the models for 
ldrugexp and ltotothr, and then test that the coefficient for private is the 
same in the two equations. We have 


. * suest used for cross-equations test of separately estimated models 
. qui regress ldrugexp age age2 actlim totchr medicaid private 


. estimates store DRUG 


. qui regress ltotothr age age2 educyr actlim totchr private 


. estimates store OTHER 
. suest DRUG OTHER 
Simultaneous results for DRUG, OTHER 


Number of obs = 3,384 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
DRUG_mean 

age . 2764148 .0792753 3.49 0.000 . 121038 -4317915 
age2 -.0018315 .0005265 -3.48 0.001 - .0028633 -.0007996 
actlim . 357446 .0454683 7.86 0.000 . 2683298 .4465622 
totchr -4035182 .0162768 24.79 0.000 . 3716163 -4354201 
medicaid . 0893386 .0622201 1.44 0.151 -.0326105 .2112878 
private .0775393 .0442757 1.75 0.080 -.0092395 . 1643181 
_cons -4.402228 2.969646 -1.48 0.138 -10.22263 1.418171 

DRUG_lInvar 
_cons . 2695824 .0309126 8.72 0.000 . 2089948 . 3301699 

OTHER_mean 
age . 3173817 . 1029119 3.08 0.002 . 115678 .5190854 
age2 -.0020875 . 0006844 -3.05 0.002 - .0034289 -.0007461 
educyr .0650207 .0076146 8.54 0.000 .0500964 .079945 
actlim . 7421208 . 0635696 11.67 0.000 .6175266 .866715 
totchr . 295988 .0204506 14.47 0.000 . 2559057 . 3360704 
private . 258998 .0542141 4.78 0.000 . 1527402 . 3652557 
_cons -6.1414 3.849682 -1.60 0.111 -13.68664 1.403837 

OTHER_1lnvar 
_cons . 7906974 .0241107 32.79 0.000 . 7434413 .8379534 


. test _b[DRUG_mean: private] 


= _b[OTHER_mean: private] 


( 1) [DRUG_mean]private - [OTHER_mean]private = 0 
chi2( 1) = 7.93 
Prob > chi2 = 0.0049 


The default for the suest command is to report heteroskedastic robust 
standard errors. The command suest DRUG OTHER, vce(cluster age) 
would instead produce cluster—robust standard errors and subsequent tests, 
with clustering on age. 


6.9 Survey data: Weighting, clustering, and stratification 


We now turn to a quite different topic: adjustments to standard estimation 
methods when the data are not from a simple random sample, as we have 
implicitly assumed, but instead come from complex survey data. The issues 
raised apply to all estimation methods, including single-equation least- 
squares estimation of the linear model, on which we focus here. 


Complex survey data lead to a sample that can be weighted, clustered, 
and stratified. From section 3.8, weighted estimation, if desired, can be 
performed by using the estimation command modifier [pweight=weight]. 
(This is a quite different reason for weighting than is that leading to the use 
of aweights in section 6.3.4.) Valid standard errors that control for clustering 
can be obtained by using the vce (cluster clustvar) option. This is the 
usual approach in microeconometric analysis—standard errors should 
always control for any clustering of errors, and weighted analysis may or 
may not be appropriate depending on whether a census coefficient approach 
or a model approach is taken; see section 3.8.3. 


The drawback to this approach is that while it yields valid estimates, it 
ignores the improvement in precision of these estimates that arises because 
of stratification. This leads to conservative inference that uses overestimates 
of the standard errors, though for regression analysis, this overestimation 
need not be too large. The attraction of survey commands, performed in 
Stata by using the svy prefix, is that they simultaneously do all three 
adjustments, including that for stratification. 


6.9.1 Survey design 


As an example of using complex survey data, we begin with a modified 
version of the nhanes2.dta, which is provided at the Stata website. These 
data come from the second National Health and Nutrition Examination 
Survey, a U.S. survey conducted in 1976—1980. 


We consider models for the hemoglobin count, a measure of the amount 
of the oxygen-transporting protein hemoglobin present in one’s blood. Low 


values are associated with anemia. We estimate both the mean and the 
relationship with age and gender, restricting analysis to nonelderly adults. 
The question being asked is a purely descriptive one of how does 
hemoglobin vary with age and gender in the population. To answer the 
question, we should use sampling weights because the sample design is such 
that different types of individuals appear in the survey with different 
probabilities. 


Here is a brief explanation of the survey design for the data analyzed: 
The country is split into 32 geographical strata. Each stratum contains a 
number of primary sampling units (PSUs), where a PSU represents a county or 
several contiguous counties with an average population of several hundred 
thousand people. Exactly 2 psUs were chosen from each of the 32 strata, and 
then several hundred individuals were sampled from each psu. The sampling 
of psus and individuals within the Psu was not purely random, so sampling 
weights are provided to enable correct estimation of population means at the 
national level. Observations on individuals may be correlated within a given 
PSU but are uncorrelated across PSUs, so there is clustering on the psu. And 
the strata are defined so that PsUs are more similar within strata than they are 
across strata. This stratification improves estimator efficiency. 


We can see descriptions and summary statistics for the key survey design 
variables and key analysis variables by typing 


. * Survey data example: NHANES II data 
. Clear all 


. qui use mus206nhanes 
. qui keep if age >= 21 & age <= 65 


. describe sampl finalwgt strata psu 


Variable Storage Display Value 
name type format label Variable label 
sampl long %9.0g Unique case identifier 
finalwgt long 9 .0g Sampling weight (except lead) 
strata byte 49 .0g Stratum identifier, 1-32 
psu byte 49 .0g Primary sampling unit, 1 or 2 
. summarize sampl finalwgt strata psu 
Variable Obs Mean Std. dev. Min Max 
sampl 8,136 33518.94 18447 . 04 1400 64702 
finalwgt 8,136 12654.81 7400.205 2079 79634 
strata 8,136 16.67146 9.431087 1 32 
psu 8,136 1.487955 . 4998856 1 2 


There are three key survey design variables. The sample weights are 
given in finalwgt and take on a wide range of values, so weighting may be 
important. The strata are given in strata and are numbered 1 to 32. The psu 
variable defines each PSU within strata and takes on only values of 1 and 2 
because there are two PsUs per strata. 


Before survey commands can be used, the survey design must be 
declared by using the svyset command. For a single-stage survey, the 
command syntax is 


svyset | psu | | weight | l, design_options options | 


For our data, we can provide all three of these quantities, as follows: 


* Declare survey design 
svyset psu [pweight=finalwgt], strata(strata) 
Sampling weights: finalwgt 
VCE: linearized 
Single unit: missing 
Strata 1: strata 
Sampling unit 1: psu 
FPC 1: <zero> 


For our dataset, the PSU variable was named psu, and the strata variable was 
named strata, but other names could have been used. The output 


VCE: linearized means the vce will be estimated using Taylor linearization, 
which is analogous to cluster—robust methods in the nonsurvey case. An 
alternative that we do not consider is balanced repeated replication, which 
can be an improvement on linearization and requires provision of replicate- 
weight variables that ensure respondent confidentiality, whereas provision of 
variables for the strata and Psu may not. The output FPC 1: <zero> means 
that no finite-population correction (FPC) is provided. The FPC corrects for the 
complication that sampling is without replacement rather than with 
replacement, but this correction is necessary only if a considerable portion of 
the PSUs in a stratum are actually sampled. The FPC is generally unnecessary 
for a national survey of individuals, unless the number of PsUs in a stratum is 
very small. 


The design information is given for a single-stage survey. In fact, the 
second National Health and Nutrition Examination Survey is a multistage 
survey with sample segments (usually city blocks) chosen from within each 
PSU, households chosen from within each segment, and individuals chosen 
from within each household. This additional information can also be 
provided in svyset but is often not available for confidentiality reasons, and 
by far the most important information is declaring the first-stage sampling 
units. 


The svydescribe command gives details on the survey design: 


* Describe the survey design 
svydescribe 
Survey: Describing stage 1 sampling units 
Sampling weights: finalwgt 
VCE: linearized 
Single unit: missing 
Strata 1: strata 
Sampling unit 1: psu 
FPC 1: <zero> 


Number of obs per unit 


Stratum # units # obs Min Mean Max 
1 2 286 132 143.0 154 
2 2 138 57 69.0 81 
3 2 255 103 127.5 152 
4 2 369 179 184.5 190 
5 2 215 93 107.5 122 
6 2 245 112 122.5 133 
7 2 349 145 174.5 204 
8 2 250 114 125.0 136 
9 2 203 88 101.5 115 

10 2 205 97 102.5 108 
11 2 226 105 113.0 121 
12 2 253 123 126.5 130 
13 2 276 121 138.0 155 
14 2 327 163 163.5 164 
15 2 295 145 147.5 150 
16 2 268 128 134.0 140 
17 2 321 142 160.5 179 
18 2 287 117 143.5 170 
20 2 221 95 110.5 126 
21 2 170 84 85.0 86 
22 2 242 98 121.0 144 
23 2 277 136 138.5 141 
24 2 339 162 169.5 177 
25 2 210 94 105.0 116 
26 2 210 103 105.0 107 
27 2 230 110 115.0 120 
28 2 229 106 114.5 123 
29 2 351 165 175.5 186 
30 2 291 134 145.5 157 
31 2 251 115 125.5 136 
32 2 347 166 173.5 181 
31 62 8,136 57 131.2 204 


For this data extract, only 31 of the 32 strata are included (stratum 19 is 
excluded), and each stratum has exactly 2 psus, so there are 62 distinct PSUs 
in all. 


6.9.2 Survey mean estimation 


We consider estimation of the population mean of hgb, the hemoglobin count 
with a normal range of approximately 12—15 for women and 13.5—16.5 for 
men. To estimate the population mean, we should definitely use the 
sampling weights. 


To additionally control for clustering and stratification, we give the svy 
prefix before mean. We have 


* Estimate the population mean using svy: 
svy: mean hgb 
(running mean on estimation sample) 


Survey: Mean estimation 


Number of strata = 31 Number of obs = 8,136 
Number of PSUs = 62 Population size = 102,959,526 
Design df = 31 

Linearized 
Mean std. err. [95% conf. interval] 
hgb 14.29713 .0345366 14.22669 14.36757 


The population mean is quite precisely estimated with a 95% confidence 
interval [14.23, 14.37]. 


What if we completely ignored the survey design? We have 


. * Estimate the population mean using no weights and no cluster 
. mean hgb 


Mean estimation Number of obs = 8,136 


Mean Std. err. [95% conf. interval] 


hgb 14.28575 0153361 14.25569 14.31582 


In this example, the estimate of the population mean is essentially 
unchanged. There is a big difference in the standard errors. The default 
standard-error estimate of 0.015 is wrong for two reasons: it is 
underestimated because of failure to control for clustering, and it is 
overestimated because of failure to control for stratification. Here 

0.015 < 0.035, SO, as in many cases, the failure to control for clustering 
dominates and leads to great overstatement of the precision of the estimator. 


6.9.3 Survey linear regression 


The svy prefix before regress simultaneously controls for weighting, 
clustering, and stratification declared in the preceding svyset command. We 


type 


. * Regression using svy: 
. Svy: regress hgb age female 
(running regress on estimation sample) 


Survey: Linear regression 


Number of strata = 31 Number of obs = 8,136 
Number of PSUs = 62 Population size = 102,959,526 
Design df = 31 
F(2, 30) = 2071.57 
Prob > F = 0.0000 
R-squared = 0.3739 
Linearized 
hgb | Coefficient std. err. t P>|t | [95% conf. interval] 
age .0021623 .0010488 2.06 0.048 . 0000232 . 0043014 
female -1.696847 .0261232 -64.96 0.000 -1.750125 -1.643568 
_cons 15.0851 -0651976 231.38 0.000 14.95213 15.21807 


The hemoglobin count increases slightly with age and is considerably lower 
for women when compared with the sample mean of 14.3. 


The same weighted estimates, with standard errors that control for 
clustering but not stratification, can be obtained without using survey 
commands. To do so, we first need to define a single variable that uniquely 
identifies each PSU, whereas survey commands can use two separate 
variables, here strata and psu, to uniquely identify each psu. Specifically, 
strata took 31 different integer values, while psu took only the values 1 and 


2. To make 62 unique PSU identifiers, we multiply strata by 2 and add psu. 
Then we have 


* Regression using weights and cluster on PSU 
. generate uniqpsu = 2*strata + psu // Make unique identifier for each PSU 


. regress hgb age female [pweight=finalwgt], vce(cluster uniqpsu) 
(sum of wgt is 102,959,526) 


Linear regression Number of obs = 8,136 
F(2, 61) = 1450.50 
Prob > F = 0.0000 
R-squared = 0.3739 
Root MSE Ş 1.0977 


(Std. err. adjusted for 62 clusters in uniqpsu) 


Robust 
hgb | Coefficient std. err. t P>|t| [95% conf. interval] 
age 0021623 .0011106 1.95 0.056 -.0000585 0043831 
female -1.696847 .0317958 -53.37 0.000 -1.760426 -1.633267 
_cons 15.0851 0654031 230.65 0.000 14.95432 15.21588 


The regression coefficients are the same as before. The standard errors for 
the slope coefficients are roughly 5% and 20% larger than those obtained 
when the svy prefix is used, so using survey methods to additionally control 
for stratification improves estimator efficiency. 


Finally, consider a naive OLS regression without weighting or without 
obtaining cluster—robust VCE: 


* Regression using no weights and no cluster 
. regress hgb age female 


Source SS df MS Number of obs = 8,136 
F(2, 8133) = 2135.79 

Model 5360 . 48245 2 2680.24123 Prob > F = 0.0000 
Residual 10206 . 2566 8,133 1.25491905  R-squared = 0.3444 
Adj R-squared = 0.3442 

Total 15566.7391 8,135 1.91355121 Root MSE = 1.1202 
hgb | Coefficient Std. err. t P>|tl [95% conf. interval] 

age .0013372 .0008469 1.58 0.114 - .0003231 . 0029974 
female -1.624161 .024857 -65.34 0.000 -1.672887 -1.575435 


_cons 15.07118 . 0406259 370.97 0.000 14.99154 15.15081 


Now the coefficient of age has changed considerably and standard errors are, 
erroneously, considerably smaller because of failure to control for clustering 
on the Psu. 


For most microeconometric analyses, one should always obtain standard 
errors that control for clustering, if clustering is present. Many data extracts 
from complex survey datasets do not include data on the psu, for 
confidentiality reasons or because the researcher did not extract the variable. 
Then a conservative approach is to use nonsurvey methods and obtain 
standard errors that cluster on a variable that subsumes the Psus, for 
example, a geographic region such as a state. 


As emphasized in section 3.8, the issue of whether to weight in 
regression analysis (rather than mean estimation) with complex survey data 
is a subtle one. For the many microeconometrics applications that are 
assumed to include appropriate control variables, it is unnecessary. 


6.10 Additional resources 


Microeconometrics studies increasingly base inference on cluster—robust 
standard errors, both in cross-sectional settings similar to those studied in 
this chapter and in the panel setting of chapter 8. Cameron and 
MacKinnon and Webb (2020) provide surveys of the subject. While this 
chapter focused on the OLS estimator, the methods generalize to nonlinear 
estimators such as the logit estimator and to generalized methods of 
moments estimators. 


The sureg command introduces multiequation regression. Related 
multiequation commands are [Mv] mvreg, [R] nlsur, and [R] reg3. The 
multivariate regression command mvreg is essentially the same as sureg. 
The nlsur command generalizes sureg to nonlinear equations; see 
section 18.11. The reg3 command generalizes the SUR model to handle 
endogenous regressors; see section 7.10. Linear mixed models are detailed 
in [ME] mixed. 


Econometrics texts give little coverage of survey methods, and the 
survey literature is a stand-alone literature that is relatively inaccessible to 
econometricians. The Stata [Svy] Stata Survey Data Reference Manual is 
quite helpful. Econometrics references include Bhattacharya (2005), 
Cameron and Trivedi (2005), and Kreuter and Valliant (2007). Abadie 
et al. (2022) propose inference that is both sampling based and designed 
based; see section 24.4.7. 


6.11 Exercises 


1. Generate data by using the same DapP as that in section 6.3, and 
implement the first step of FGLS estimation to get the predicted 
variance varu. Now, compare several different methods to implement 
the second step of weighted estimation. First, use regress with the 
modifier [aweight=1/varu], as in the text. Second, manually 
implement this regression by generating the transformed variable 
try=y/sqrt (varu) and regressing try on the similarly constructed 
variables trx2, trx3, and trone, using regress with the noconstant 
option. Third, use regress with [pweight=1/varu], and show that the 
default standard errors using pweights differ from those using 
aweights because the pweights default is to compute robust standard 
errors. 

2. Consider the same DGP as that in section 6.3. Given this specification 
of the model, the rescaled equation 
y/w = Pi(1/w) + Bo(x2/w) + 63(%3/w) + e, where 
w = v exp(—1 + 0.222), will have the error e, which is normally 
distributed and homoskedastic. Treat w as known, and estimate this 
rescaled regression in Stata by using regress with the noconstant 
option. Compare the results with those given in section 6.3, where the 
weight w was estimated. Is there a big difference here between the GLS 
and FGLS estimators? 

3. Consider the same DGP as that in section 6.3. Suppose that we 
incorrectly assume that u ~ N (0, 0x3). Then FGLS estimates can be 
obtained by using regress with [pweight=1/x2sq], where x2sq=x22. 
How different are the estimates of (61, 52, G3) from the OLs results? 
Can you explain what has happened in terms of the consequences of 
using the wrong skedasticity function? Do the standard errors change 
much if robust standard errors are computed? Use the estat hettest 
command to check whether the regression errors in the transformed 
model are homoskedastic. 

4. For the model and data of section 6.7, verify that mixed with the mle 
option gives the same results as xtreg, mle. Also, compare the results 
with those from using mixed with the rem1 option. Fit the two-way RE 
model assuming random individual and time effects, and compare 


results with those from when the time effects are allowed to be fixed 
(in which case time dummies are included as regressors). 

. Consider the same dataset as in section 6.8. Repeat the analysis of 
section 6.8 using the dependent variables drugexp and totothr, which 
are in levels rather than logs (so heteroskedasticity is more likely to be 
a problem). First, estimate the two equations using OLS with default 
standard errors and robust standard errors, and compare the standard 
errors. Second, estimate the two equations jointly using sureg, and 
compare the estimates with those from oLs. Third, use the bootstrap 
prefix to obtain robust standard errors from sureg, and compare the 
efficiency of joint estimation with that of oLs. Hint: It is much easier to 
compare estimates across methods if the estimates command is used; 
see section 3.5.6. 

. Consider the same dataset as in section 6.9. Repeat the analysis of 
section 6.9 using all observations rather than restricting the sample to 
ages between 21 and 65 years. First, declare the survey design. 
Second, compare the unweighted mean of hgb and its standard error, 
ignoring survey design, with the weighted mean and standard error, 
allowing for all features of the survey design. Third, do a similar 
comparison for least-squares regression of hgb on age and female. 
Fourth, estimate this same regression using regress with pweights 
and cluster—robust standard errors, and compare with the survey 
results. 

. Reconsider the dataset from section 6.8.3. Estimate the parameters of 
each equation by OLS. Compare these OLS results with the SUR results 
reported in section 6.8.3. 


Chapter 7 
Linear instrumental-variables regression 


7.1 Introduction 


The fundamental assumption for consistency of least-squares estimators is 
that the model error term is not correlated with the regressors; that is, Cov 
(u,x) = 0. 


If this assumption fails, the ordinary least-squares (OLS) estimator is 
inconsistent, and the OLS estimator can no longer be given a causal 
interpretation. Specifically, the OLS estimate B; can no longer be interpreted 
as estimating the marginal effect on the dependent variable y of an 
exogenous change in the jth regressor variable zj. This is a fundamental 
problem because such marginal effects are a key input to economic policy. 


The instrumental-variables (Iv) estimator provides a consistent estimator 
under the very strong assumption that valid instruments exist, where the 
instruments z are variables that are correlated with the regressors x that 
satisfy Cov(u,z) = 0. 


The Iv approach is the original and leading approach for estimating the 
parameters of models with endogenous regressors and errors-in-variables 
models. Mechanically, the tv method is no more difficult to implement than 
OLS regression. 


Conceptually, the rv method is more difficult than other regression 
methods because it can be very difficult to obtain instruments z that are 
valid: they must be variables that do not directly determine the dependent 
variable y to ensure that E'(u|z) = 0. 


Even where such instruments exist, they may be weakly correlated with 
the endogenous regressors, leading to a great loss of estimator precision. 
And in extreme cases of weak correlation that are often encountered in 
practice, standard asymptotic theory provides a poor guide in finite 
samples. The Anderson—Rubin test provides a way to obtain valid inference 
on the coefficient of an endogenous regressor in the presence of weak 
instruments while using regular asymptotics. But this method entails 
substantial loss of power in overidentified models with many instruments. 


Then statistical inference is obtained under weak-instrument asymptotics 
that are well established when model errors are independent and identically 
distributed (i.1.d.) but are still being developed when model errors are not 
iid. 


For a long time, the rv method has been the leading method used in 
microeconometrics to establish estimates that can be given a causal 
interpretation. More recently, experimental and quasiexperimental methods 
have been increasingly used; these methods are detailed in chapters 24 
and 25. 


7.2 Simultaneous equations model 


It is helpful to begin the topic of regression with endogenous regressors with 
a brief coverage of the classic simultaneous equations model because the Iv 
method arises naturally in models consisting of two or more endogenous 
variables with two or more corresponding equations, one equation for each 
endogenous variable. For example, in the classic market-clearing demand 
and supply model, the two equations would correspond to the two 
endogenous interdependent variables, equilibrium price and quantity traded, 
whose values would be determined jointly by supply and demand. 


7.2.1 Structural model 


We consider a system of G equations in G variables y. The complication is 
that these variables are interdependent. Thus, they are called endogenous, 
meaning that they come from within the system. Additionally, the system 
includes i variables z, called exogenous variables, whose values are 
determined outside the system. 


For the jth observation in the system, the G stochastic equations are 
written in matrix notation as 


y,B+2zT = u; (7.1) 


where y;, B, Z;, T, and u; have dimensions (G x 1), (G x G), (K x 1), ( 

K x G), and (G x 1), respectively, and 6;,, the jkth entry in B, is the 
coefficient of yx: in the equation for Yji. Then, for example, y1; depends on a 
linear combination of Yy2i,---, YGi, a linear combination of Zii, ...,ZKi, and 
the error U1;. 


The matrix B is normalized to have 1 along the diagonal and should 
have rank G if the equation system is complete. The off-diagonal elements 
of B reflect the interdependence between the endogenous variables. Special 
notable cases are B diagonal, which implies no direct interdependence, and 


B lower triangular, in which case direct dependence is recursive or 
unidirectional. 


In the language of simultaneous equations, the linear model (7.1) is 
referred to as the structural model. Its central importance arises from its 
property that it purports to model the direct interdependence between 
endogenous variables. 


Interdependence between elements of y can be indirect through nonzero 
covariance between the elements of u. A standard assumption is E (u;) = 0 
and E(u;u}) = 4, where > is a symmetric positive definite matrix. One 
special notable case is a diagonal X, which implies that random shocks that 
impact individual elements of y are mutually uncorrelated. 


The elements of matrices B and T are subject to a priori assumptions 
that play an important role in identification of the model. For example, if a 
certain off-diagonal element of B is set to 0, say Bjk = 0, then this means 
that a priori one rules out that yx directly affects Yj. Similarly, y;, = 0 
means that a priori zx does not directly impact Yj. Such a priori restrictions 
on dependence are necessary for unique identification of the parameters of 
the model. 


The structural representation has a special appeal for several reasons. 


1. Equations themselves have interpretations as economic relationships 
such as, for example, demand or supply relations. 

2. Equations are grounded in economic theory in the sense that (B, T, £) 
are subject to restrictions of economic theory. 

3. B embodies “causal” or direct connections between endogenous 
variables that are often the key target of empirical inquiry. 


7.2.2 Reduced-form model 


The equation system (7.1) is driven by changes in (z;, u;). This can be seen 
directly by reexpressing the structural model in a solved form for yi. For 
specified values of (B, T) and (z;, u;), the G linear simultaneous equations 
can in principle be solved for y; by postmultiplying the structural model 
(L1) by B-t. 


Then y; + z/TB~! = uB}, leading to the reduced-form model 
y; = 2,11 + v; (7.2) 


where the reduced-form error v! = w.Bo! and the reduced-form coefficients 
are given by 


II = -TB" (7.3) 


In the simultaneous equations framework, the reduced form (7.2) is of 
interest because of the following: 


1. It captures both the direct and indirect effects of change in an 
exogenous variable. 

2. It is always identified given sufficient sample variation. 

3. It permits conditional prediction of endogenous outcomes defined as 
B(ylz') = 2'TL 

4. It also embodies the a priori restrictions on the structural specification. 

5. It offers the potential of identifying the structural parameters B and T 
when there is a unique solution of (7.3). 


When the matrix equations (7.3) can be solved uniquely for the unknown 
elements of (B, I) in terms of the elements of II, the structural model is 
said to be identified. Note that consistent estimation of [J is possible under 
quite weak conditions. But, given [], additional conditions are required to 
identify (B, I). A necessary condition is that the matrix Tī should have 
sufficient number of entries to yield the required number of equations. This 
in turn means that the model should contain a sufficient number of linearly 
independent exogenous variables. This condition will need to be refined 
further. As we do so, it becomes possible to give a more precise definition of 
an instrumental variable. 


7.2.3 Recursive systems 


A special case that deserves specific mention is one in which the structural 
equations can be ordered such that the matrix B is either lower or upper 
triangular; this is also known as the recursive case because it can be 
interpreted as a case in which the endogenous variables are determined 
recursively, with the higher-order endogenous variables being determined by 
the endogenous variables with lower ordering. Recursiveness implies that 
there is no direct feedback from the higher-order variables to the lower-order 
variables. However, even in this case joint dependence between the lower- 
and higher-order variables can arise through correlated error terms. 


Methods to control for endogeneity in nonlinear models, such as those 
for binary outcomes and counts that are detailed in later chapters, are almost 
exclusively restricted to recursive systems. In the usual case of a single 
endogenous regressor, a “structural” equation specifies the dependent 
variable yı to depend on an endogenous variable y2 and exogenous 
variables, while a “first-stage” equation specifies y2 to depend on exogenous 
variables but, importantly, not on y1. 


7.2.4 Generating a sample with simultaneous dependence 


The background of the preceding section is helpful for generating samples 
with simultaneous dependence. Monte Carlo studies of estimators for models 
with endogenous regressors typically use such samples. 


It is not possible to directly generate a sample from the structural model 
(7.1) because of the interdependence of the endogenous variables. Instead, 
we use the reduced form of the model. 


Consider the following two-equation simultaneous equations model in 
which (y1, y2) are endogenous variables and (x1, 72) are exogenous 
variables, 


Yii = B12Y2i + Y12L1i ay 
Yi = B21Y1i + Yo1t2i + Uzi 


where the error terms have zero mean conditional on exogenous 
variables, 


Eig |e; za) = 0 
E(uzi|£1i, 221) = 0 
where i = 1, za a- N; E(u?,) = oF E(u3,) = o2; and E(uiiuni) = 012. 


Note that this model excludes some variables from each equation. Here 
x2 is excluded from the structural equation for Yi, and 1 is excluded from 
the structural equation for Y2. Exclusion restrictions such as these ensure 
identification of the structural parameters of this model, here 612, 39, Y12, 
and ‘21. 


We wish to generate a sample with N = 1000, G12 = y12 = %21 = 1, 
B21 = 0.25, and with (£14, £2;) and (u1;, Ugi) being pairs of correlated 
random variables. To generate data on Yı: and Y2: given the preceding setup, 
we need to use the reduced form of the model. After a little algebra, the 
reduced form for Yı: is 


Yu = (1 — Biba) {yet + Bi2( V212; + Uzi) + Uri} 


Substituting this reduced-form expression for Yı: in the Y2: structural 
equation yields the second reduced-form equation. Then, one makes 
independent bivariate draws of (£11, £2;) and (u1;, uzi) and generates 

(Yii, Ya; ) after inserting the numerical values of the structural parameters. In 
the example, (21;, £2;) are bivariate normal with correlation of 0.3, and 
(uii, ugi) are bivariate normal with correlation of 0.7. 


. * Generate data for simultaneous equations model with b12=1; b21=c12=c21=1 
. qui set obs 1000 


. set seed 10101 

. matrix C = (1, .7, 1) // Variances 1, covariance 0.7 
. drawnorm ul u2, n(1000) corr(C) cstorage(lower) // Bivariate normal (u1, u2) 

. matrix C = (1, .3, 1) 

. drawnorm x1 x2, n(1000) corr(C) cstorage(lower) // Bivariate normal (x1, x2) 

. generate y1 =(1/(1-.25))*(x1 + 1*(x2+u2) + u1) // Reduced form for y1 


. generate y2 = 0.25*yi + 1*x2 + u2 // Generating y2 given y1 
summarize yl y2 x1 x2 ul u2 
Variable Obs Mean Std. dev. Min Max 
yl 1,000 .0721602 3.30318 -12.3202 11.20666 
y2 1,000 . 0407909 2.169822 -8.320311 7.68481 
x1 1,000 .0271982 1.006009 -3.282559 3.596714 
x2 1,000 -.010926 1.003211 -2.847878 3.216181 
ul 1,000 .0041711 1.013016 -4.119125 3.093716 
u2 1,000 .0336769 1.005745 -3.293353 3.371781 
. correlate y1 y2 x1 x2 ul u2 
Cobs=1 ,000) 
yl y2 x1 x2 ul u2 
yl 1.0000 
y2 0.9465 1.0000 
x1 0.5609 0.3873 1.0000 
x2 0.5151 0.6538 0.3299 1.0000 
ul 0.6763 0.5597 0.0065 -0.0485 1.0000 
u2 0.7071 0.7281 0.0459 -0.0098 0.7006 1.0000 


In this example, the endogenous variables yı and Y2 are very highly 
correlated. 


7.2.5 Estimation in the simultaneous equations example 


We consider estimation of the first equation in the system, 

Yii = Bi2yoi + 71241; + ui; . From the previous output, Cor 

(yoi, U1i) = 0.56, so the regressor Y2: is highly correlated with the error 
term. Thus, OLS yields inconsistent parameter estimates. 


We obtain 


. * OLS is inconsistent 
. regress y1 y2 x1, vce(robust) noheader 


Robust 
Coefficient std. err. t P>|t| [95% conf. interval] 
1.306056 .0132079 98.88 0.000 1.280137 1.331974 
. 7508814 .0273001 27.50 0.000 .6973092 . 8044537 
-.0015376 .0255905 -0.06 0.952 -.051755 .0486797 


The quite precise slope coefficient estimates are 1.306 and 0.751, 


substantially and statistically different from the data-generating process 
(DGP) values of 1.0 and 1.0. 


The Iv method explained below yields consistent estimates, given 
existence of a variable that does not belong in the equation for Yı: yet is 
correlated with the endogenous regressor Y2:. The variable £2; was excluded 
from the yı; equation, and from the preceding output, Cor(x2;, u;) = —0.05 
and Cor(y2;, %2;) = 0.65. 


Using the ivregress command, presented in section 7.4.1, we obtain 


. * IV with valid instrument (here x2 for y2) are consistent 
. ivregress 2sls y1 (y2=x2) x1, vce(robust) noheader 


Robust 
yi | Coefficient std. err. Zz P>lz| [95% conf. interval] 
y2 . 9550988 . 0289822 32.95 0.000 . 8982947 1.011903 
x1 1.04403 .0399915 26.11 0.000 . 9656476 1.122412 
_cons .0048051 .0337553 0.14 0.887 -.061354 0709642 


Instrumented: y2 
Instruments: x1 x2 


The slope coefficient estimates are 0.955 and 1.044 and at level 0.05 are not 
statistically different from the DGP values of 1.0 and 1.0. 


Similarly, OLS of y2 on y1 and x2 yields inconsistent estimates, and 
instead we should perform Iv regression with x1 an instrument for y1. 


Consistent estimates by joint estimation of the entire system can be 
obtained using the reg3 command presented in section 7.10. In this special 


example, the results are identical to those obtained by separate Iv estimation 
of each equation. 


7.3 Instrumental-variables regression 


Iv estimation enables estimation of a single equation in a system. 
Furthermore, it does not require specification of the remaining equations in a 
structural system. It only requires specification of one or more instrumental 
variables. 


7.3.1 Basic IV theory 


We introduce Iv methods in the simplest regression model, one where the 
dependent variable y is regressed on a single regressor zx: 


Y= Pru (7.4) 


The model is written without the intercept. This leads to no loss of generality 
if both y and x are measured as deviations from their respective means. 


For concreteness, suppose y measures earnings, x measures years of 
schooling, and u is the error term. The simple regression model assumes that 
x is uncorrelated with the errors in (7.4). Then the only effect of x on y is a 
direct effect via the term Gx. Schematically, we have the following path 
diagram: 
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The absence of a directional arrow from u to x means that there is no 
association between x and u. Then the OLs estimator 8 = X, x;y; / D>, x? is 


consistent for £. 


The error u embodies all factors other than schooling that determine 
earnings. One such factor in u is ability. However, high ability will induce 


correlation between x and u because high (low) ability will on average be 
associated with high (low) years of schooling. Then a more appropriate 
schematic diagram is 


re oe 
ye 


Z= 8 


where now there is an association between x and u. 


The OLS estimator B is then inconsistent for 3 because 8 combines the 
desired direct effect of schooling on earnings (8) with the indirect effect that 
people with high schooling are likely to have high ability, high u, and hence 
high y. For example, if one more year of schooling is found to be associated 
on average with a $1,000 increase in annual earnings, we are not sure how 
much of this increase is due to schooling per se (8) and how much is due to 
people with higher schooling having on average higher ability (so higher u). 


The regressor x is said to be endogenous, meaning it arises within a 
system that influences u. By contrast, an exogenous regressor arises outside 
the system and is unrelated to u. The inconsistency of @ is referred to as 
endogeneity bias because the bias does not disappear asymptotically. 


The obvious solution to the endogeneity problem is to include as 
regressors controls for ability. But such regressors may not be available. Few 
earnings—schooling datasets additionally have measures of ability such as IQ 
tests; even if they do, there are questions about the extent to which they 
measure inherent ability. 


The Iv approach provides an alternative solution. We introduce a (new) 
instrumental variable, z, which has the property that changes in z are 
associated with changes in x but do not lead to changes in y (except 
indirectly via x). This leads to the following path diagram: 
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For example, proximity to college (z) may determine college attendance (x) 
but not directly determine earnings (y). 


The Iv estimator in this simple example is Be — y Ziyi | ae z,x,;- This 
can be interpreted as the ratio of the covariance of y with z to the covariance 
of x with z or, after some algebra, as the ratio of dy/dz to dx/dz. For 
example, if a one-unit increase in z is associated with 0.2 more years of 
education and with $500 higher earnings, then iy = $500/0.2 = $2500, so 
one more year of schooling increases earnings by $2,500. 


The Iv estimator Bw is consistent for 6 provided that the instrument z is 
uncorrelated with the error u and correlated with the regressor zx. 


7.3.2 Model setup 


We now consider the more general regression model with the scalar 
dependent variable y1, which depends on m endogenous regressors, denoted 
by y2, and Kı exogenous regressors (including an intercept), denoted by x}. 
This model is called a structural equation, with 


Yii = Yab + Xub + ui t=1,...,N (7.5) 


The regression errors u; are assumed to be uncorrelated with x1; but are 
correlated with Y2:. This correlation leads to the OLS estimator being 
inconsistent for 8. 


To obtain a consistent estimator, we assume the existence of at least m 
instrumental variables x2 for Y2 that satisfy the assumption that 
E(uj;|xX2;) = 0. The instruments x2 need to be correlated with y2 so that they 
provide some information on the variables being instrumented. One way to 


motivate this is to assume that each component Y2; of Y2 satisfies the first- 
stage equation 


Y2ji = XiTi + X2 T2) + Uji, J = 1, se mN (7.6) 


The first-stage equations have only exogenous variables on the right-hand 
side. The first-stage equation could be obtained by first specifying a 
simultaneous equations system for all endogenous variables, as in 

section 7.2.1, and thus, it is sometimes also called a reduced-form equation. 
But often first-stage equations are specified without appeal to an underlying 
simultaneous equations system. 


The exogenous regressors X1 in (7.5) can be used as instruments for 
themselves. The challenge then is to come up with additional instruments x2. 
Often, Y2 is scalar, m = 1, and it is enough to find one additional instrument 
x2. More generally, with m endogenous regressors, we need at least m 
additional instruments x2. This can be difficult because X2 needs to be a 
variable that can be legitimately excluded from the structural model (7.5) for 
yı. 


The model (7.5) can be more simply written as 


Yi =X, B+ ui 


where the regressor vector x; = |y5,7x‘,,] combines endogenous and 
exogenous variables and the dependent variable is denoted by y rather than 
yı. We similarly combine the instruments for these variables. Then the 
vector of instrumental variables (or, more simply, instruments) is 

z; = [x',7x5,|], where X1 serves as the (ideal) instrument for itself and X2 is 
the instrument for Y2 and the instruments z satisfy the conditional moment 
restriction 


E(uj|zi) = 0 (7.7) 


In summary, we regress Y on x using instruments zZ. 


7.3.3 IV estimators: IV, two-stage least-squares, and generalized method 
of moments 


The key (and in many cases, heroic) assumption is (7.7). This implies that 
E(z;u;) = 0, and hence the moment condition, or population zero- 
correlation condition, 


E {z; (yi — x;ß)} = 0 (7.8) 


Iv estimators are solutions to the sample analog of (7.8). 


We begin with the case where dim(z) = dim(x), called the just- 
identified case, where the number of instruments exactly equals the number 
of regressors. Then the sample analog of (7.8) is ye i z;(y; — x18) = 0. 
As usual, stack the vectors x/; into the matrix X, the scalars y: into the vector 
y, and the vectors z/ into the matrix Z. Then we have Z’(y — X8) = 0. 


Solving for G leads to the Iv estimator 
Bry = (Z'X)"1Z'y 


A second case is where dim(z) < dim(x), called the not-identified or 
underidentified case, where there are fewer instruments than regressors. 
Then no consistent Iv estimator exists. This situation often arises in practice. 
Obtaining enough instruments, even just one in applications with a single 
endogenous regressor, can require considerable ingenuity or access to 
unusually rich data. 


A third case is where dim(z) > dim(x), called the overidentified case, 
where there are more instruments than regressors. This can happen 
especially when economic theory leads to clear exclusion of variables from 
the equation of interest, freeing up these variables to be used as instruments 
if they are correlated with the included endogenous regressors. Then 


Z'(y — XB) = 0 has no solution for 8 because it is a system of dim(z) 
equations in only dim(x) unknowns. One possibility is to arbitrarily drop 
instruments to get to the just-identified case. But there are more efficient 
estimators. One estimator is the two-stage least-squares (2SLS) estimator, 


Bosts = {X'Z(Z'Z) 1 Z'X} ` XZ (ZZ) Z'y 


This estimator is the most efficient estimator if the errors u; are independent 
and homoskedastic. And it equals 3,,, in the just-identified case. 


The term “2SLS” arises because the estimator can be computed in two 
steps. First, estimate by OLS the first-stage regressions given in (7.6), and 
second, estimate by OLS the structural regression (7.5) with endogenous 
regressors replaced by predictions from the first step. This mechanical 
interpretation is peculiar to linear models and does not generalize to 
nonlinear regression models; see, for example, section 20.7. 


A quite general estimator is the linear generalized method of moments 
(GMM) estimator 


Boum = (X'ZW2Z'X) ` X'ZWZ'y (7.9) 


where W is any full-rank symmetric-weighting matrix. In general, the 
weights in W may depend both on data and on unknown parameters. For 
just-identified models, all choices of W lead to the same estimator. This 
estimator minimizes the objective function 


QB ={ 5 -xeyabw{iziy-xa)b ao) 


which is a matrix-weighted quadratic form in Z'(y — X8). 


For GMM, some choices of W are better than others. The 2SLS estimator is 
obtained with weighting matrix W = (Z’Z)—'. The optimal GMM estimator 
uses W = S-1, so 


oe a —1 a 
Docin = (x'ZS~'z/x) X'/Z8-!Z'y 


where ¢g is an estimate of Var( N-1/27'y F If the errors u; are independent 
and heteroskedastic, then S — 1 /N > 5 a Zi Zi» where @; = yi — x! AG) and 
6 is a consistent estimator, usually — The estimator reduces to Bry in 

the just-identified case. 


In later sections, we consider additional estimators. In particular, the 
limited-information maximum-likelihood (LIML) estimator, while 
asymptotically equivalent to 2SLs, has been found in research to generally 
outperform both 2sLs and GMM in finite samples, though this better 
performance is not guaranteed. 


7.3.4 Instrument validity and relevance 


All the preceding estimators have the same starting point. The instruments 
must satisfy condition (7.7). This condition is impossible to test in the just- 
identified case. And even in the overidentified case, where a test is possible 
(see section 7.4.8), instrument validity relies more on persuasive argument, 
economic theory, and norms established in prior related empirical studies. 


Additionally, the instruments must be relevant. For the model of 
section 7.3.2, this means that after controlling for the remaining exogenous 
regressors X1, the instruments X2 must account for significant variation in y2 
. Intuitively, the stronger the association between the instruments z and x, 
the stronger will be the identification of the model. Conversely, instruments 
that are only marginally relevant are referred to as weak instruments. 


7.3.5 Weak instruments 


The first consequence of an instrument being weak is that estimation 
becomes much less precise, so standard errors can become many times 
larger, and ¢ statistics many times smaller, compared with those from 
(inconsistent) OLS. Then a promising ¢ statistic of 5 from OLS estimation may 
become a ¢ statistic of 1 from Iv estimation. If this loss of precision is 
critical, then one needs to obtain better instruments or more data. 


The second consequence is that even if an Iv estimator is consistent, the 
standard asymptotic theory may provide a poor approximation to the actual 
sampling distribution of the Iv estimator in typical finite-sample sizes. For 
example, the asymptotic critical values of standard Wald tests can lead to 
tests whose actual size differs considerably from the nominal size, and hence 
the tests are potentially misleading. 


This problem arises in part because in finite samples the Iv estimator is 
not centered on 3, even though in infinite samples it is consistent for G. And 
it arises in part because the distribution is not normal in finite samples. The 
problem is referred to as the finite-sample bias of Iv, even in situations 
where formally the mean of the estimator does not exist. The question “How 
large of a sample size does one need before these biases become 
unimportant?” does not have a simple answer. This issue is considered 
further in sections 7.5—7.8. 


7.3.6 Robust standard-error estimates 


Table 7.1 provides a summary of three leading variants of the Iv family of 
estimators. For just-identified models, we use the Iv estimator because the 
other models collapse to the Iv estimator in that case. For overidentified 
models, the standard estimators are 2SLs and optimal GMM. 


Table 7.1. Iv estimators and their asymptotic variances 


Estimator Estimator definition and estimate of the VCE 


IV (just-identified) Biy = (Z’X)~1Z’y 
V(B) = (Z'X)~'S(Z'x)-1 


2SLS Boss =I X227) ZX) X!Z(Z!Z) Zy 
FB) = {X'Z(Z'Z) Z' X} K'Z(Z'Z) 8 (ZZ) ZK 
x {X!Z (ZZ) AX 
Pa A —1 A 
Optimal GMM Boma = (X’28 ‘Z'X) XZ Zy 


¥ (Boom) = (x'zS‘z'x) 


The formulas given for estimates of the vCEs are robust estimates, where 
in this table ¢ is an estimate of the asymptotic variance of Zu. 


For heteroskedastic errors, S = yL ; a? Ziz!» and we use the 
vce (robust) option. 


For clustered errors, we use the vce (cluster clustvar) option. In that 
case, with G clusters, S = {G/(G — 1)} x } `, Zgūgū, Z,» where g denotes 
the gth cluster. 


Other estimates, notably heteroskedasticity- and autocorrelation- 
consistent (HAC) robust variance estimates are also possible. 


Sections 7.5—7.7 consider alternative inference methods when 
instruments are weak. Current research is extending methods initially 
developed for i.i.d. errors to non-i.i.d. errors such as heteroskedastic and 
clustered errors. 


7.4 Instrumental-variables example 


Most estimators in this chapter are obtained using the ivregress command. 
The rest of this section provides application to an example with a single 
endogenous regressor. 


7.4.1 The ivregress command 


The ivregress command performs Iv regression and yields goodness-of-fit 
statistics, coefficient estimates, standard errors, t statistics, p-values, and 
confidence intervals. The syntax of the command is 


ivregress estimator depvar [ varlist | (varlist2 = varlist_iv) [ af | [ in | | weight | |; options ] 


Here estimator is one of 2s1s (2SLS), gmm (optimal GMM), or 1im1 (LIML); 
depvar is the scalar dependent variable; varlist/ is the list of exogenous 
regressors; varlist2 is the list of endogenous regressors; and varlist_iv is the 
list of instruments for the endogenous regressors. Note that endogenous 
regressors and their instruments appear inside parentheses. If the model has 
several endogenous variables, they are all listed on the left-hand side of the 
equality. Because there is no iv option for estimator, in the just-identified 
case, we use the 2s1s option because 2SLS is equivalent to Iv in that case. 


An example of the command is ivregress 2sls y x1 x2 (y2 y3 = x3 x4 
x5). This performs 2sLs estimation of a structural-equation model with the 
dependent variable, y; two exogenous regressors, x1 and x2; two endogenous 
regressors, y2 and y3; and three instruments, x3, x4, and x5. The model is 
overidentified because there is one more instrument than there are 
endogenous regressors. 


In terms of the model of section 7.3.2, y1 is depvar, X1 is varlist1, Y2 is 
varlist2, and X2 is varlist_iv. In the just-identified case, varlist2 and 
varlist_iv have the same number of variables, and we use the 2s1s option to 
obtain the Iv estimator. In the overidentified case, varlist_iv has more 
variables than does varlist2. 


The first option yields considerable output from the first-stage 
regression. Several useful tests regarding the instruments and the goodness 
of fit of the first-stage regression are displayed; therefore, this option is more 
convenient than the user running the first-stage regression and conducting 
tests. 


The vce (vcetype) option specifies the type of standard errors reported 
by Stata. The options for vcetype are robust, which yields 
heteroskedasticity-robust standard errors; unadjusted, which yields 
nonrobust standard errors; cluster Clustvar; bootstrap; jackknife; and 
hac kernel. Various specification test statistics that are automatically 
produced by Stata are made more robust if the vce (robust) option is used. 


For the overidentified models fit by GMM, the wmatrix(wmtype) option 
specifies the type of weighting matrix used in the objective function [see W 
in (7.9)] to obtain optimal Gum. Different choices of wmtype lead to 
different estimators. For heteroskedastic errors, set wmtype to robust. For 
correlation between elements of a cluster, set wmtype to cluster clustvar, 
where clustvar specifies the variable that identifies the cluster. For time- 
series data with HAC errors, set wmtype to hac kernel or hac kernel # or hac 
kernel opt; see [R] ivregress for additional details. If vce () is not specified 
when wmatrix() is specified, then vcetype is set to wmtype. The igmm option 
yields an iterated version of the GMM estimator. 


7.4.2 Data and data summary 


We consider a model with one endogenous regressor, several exogenous 
regressors, and one or more excluded exogenous variables that serve as the 
identifying instruments. 


The dataset is an extract from the Medical Expenditure Panel Survey of 
individuals over the age of 65 years, similar to the dataset described in 
section 3.2.1. The equation to be estimated has the dependent variable 
ldrugexp, the log of total out-of-pocket expenditures on prescribed 
medications. The regressors are an indicator for whether the individual holds 
either employer- or union-sponsored health insurance (hi_empunion), 
number of chronic conditions (totchr), and four sociodemographic 
variables: age in years (age), indicators for whether female (female) and 


whether black or Hispanic (b1hisp), and the natural logarithm of annual 
household income in thousands of dollars (1 inc). 


We treat the health insurance variable hi_empunion as endogenous. The 
intuitive justification is that having such supplementary insurance on top of 
the near universal Medicare insurance for the elderly may be a choice 
variable. Even though most individuals in the sample are no longer working, 
those who expected high future medical expenses might have been more 
likely to choose a job when they were working that would provide 
supplementary health insurance upon retirement. Note that Medicare did not 
cover drug expenses for the time period we study. 


We use the global macro x21ist to store the names of the variables that 
are treated as exogenous regressors. We have 


. * Read data, define global x2list, and summarize data 
clear all 


. qui use mus20/7mepspresdrugs 


. global x2list totchr age female blhisp linc 


summarize ldrugexp hi_empunion $x2list 


Variable Obs Mean Std. dev. Min Max 
ldrugexp 10,367 6.47934 1.363097 (0) 10.18017 
hi_empunion 10,367 . 3797627 .4853511 (0) 1 
totchr 10,367 1.860712 1.290255 (0) 9 

age 10,367 75.04128 6.688851 65 91 

female 10,367 . 5794347 . 4936736 (0) 1 
blhisp 10,367 .1702518 .375872 (0) 1 

linc 10,068 2.745753 .9118674 -6.907755 5.744476 


Slightly less than 38% of the sample has either employer or union-sponsored 
health insurance in addition to Medicare insurance. Subsequent analysis 
drops the 299 observations with missing data on linc. 


7.4.3 Available instruments 


We consider four potential instruments for hi_empunion. Two reflect the 
income status of the individual, and two are based on employer 
characteristics. 


The ssiratio instrument is the ratio of an individual’s social security 
income to the individual’s income from all sources, with high values 
indicating a significant income constraint. The 1owincome instrument is a 
qualitative indicator of low-income status. Both these instruments are likely 
to be relevant because they are expected to be negatively correlated with 
having supplementary insurance. To be valid instruments, we need to assume 
they can be omitted from the equation for 1drugexp, arguing that the direct 
role of income is adequately captured by the regressor linc. 


The firmsz instrument measures the size of the firm’s employed labor 
force, and the multic instrument indicates whether the firm is a large 
operator with multiple locations. These variables are intended to capture 
whether the individual has access to supplementary insurance through the 
employer. These two variables are irrelevant for those who are retired or 
self-employed or who purchase insurance privately. In that sense, these two 
instruments could potentially be weak. 


. * Summarize available instruments 
. Summarize ssiratio lowincome multlc firmsz if linc!=. 


Variable Obs Mean Std. dev. Min Max 
ssiratio 10,068 . 5330892 . 3492841 (0) 1 
lowincome 10,068 . 1873262 .3901925 (0) 1 
multic 10,068 .0621772 . 2414891 (0) 1 
firmsz 10,068 . 1408224 2.172642 0 50 


We have four available instruments for one endogenous regressor. The 
obvious approach is to use all available instruments because in theory this 
leads to the most efficient estimator. In practice, it may lead to larger small- 
sample bias because the small-sample biases of Iv estimators increase with 
the number of instruments (Hahn and Hausman 2002). 


At a minimum, it is informative to use correlate to view the gross 
correlation between endogenous variables and instruments and between 
instruments. When multiple instruments are available, as in the case of 
overidentified models, then it is actually the partial correlation after 
controlling for other available instruments that matters. This important step 
is deferred to sections 7.5 and 7.6.5. 


7.4.4 IV estimation of an exactly identified model 


We begin with Iv regression of 1drugexp on the endogenous regressor 
hi _empunion, instrumented by the single instrument ssiratio, and several 
exogenous regressors. 


We use ivregress with the 2s1s estimator and the options vce (robust) 
to control for heteroskedastic errors and first to provide output that 
additionally reports results from the first-stage regression. The output is in 
two parts: 


. * IV estimation of a just-identified model with single endog regressor 
= ssiratio) $x2list, vce(robust) first 


ivregress 2sls ldrugexp (hi_empunion 


First-stage regressions 


Number of obs = 10,068 
F(6, 10061) = 165.27 
Prob > F = 0.0000 
R-squared = 0.0793 
Adj R-squared = 0.0787 
Root MSE = 0.4665 

Robust 
hi_empunion Coefficient std. err. t P>|t| [95% conf. interval] 
totchr .0133494 .0036385 3.67 0.000 .0062172 .0204817 
age - .0084926 .0007041 -12.06 0.000 -.0098728 -.0071125 
female -.0721048 . 0096379 -7.48 0.000 -.0909969 -.0532126 
blhisp - .0609939 .0122393 -4.98 0.000 -.0849853 -.0370025 
linc .0454804 .0062021 7.33 0.000 .0333231 0576377 
ssiratio - . 2205333 .0151176 -14.59 0.000 -.2501667 -.1908998 
_cons 1.039144 .0575002 18.07 0.000 . 9264318 1.151856 
Instrumental variables 2SLS regression Number of obs 10,068 
Wald chi2(6) = 2039.84 
Prob > chi2 = 0.0000 
R-squared = 0.0831 
Root MSE = 1.3038 

Robust 
ldrugexp | Coefficient std. err. z P>|zl [95% conf. interval] 
hi_empunion -.8115015 . 1884024 -4.31 0.000 -1.180763 -.4422396 
totchr -4492135 .0100423 44.73 0.000 .429531 . 468896 
age -.0122415 0027711 -4.42 0.000 -.0176727 -.0068102 
female -.0140507 .0310339 -0.45 0.651 -.0748761 .0467747 
blhisp -.2089573 . 0383529 -5.45 0.000 -.2841276 -.1337869 
linc .0815472 .0207576 3.93 0.000 .0408631 . 1222314 
_cons 6.692548 . 2449266 27.32 0.000 6.212501 7.172595 


Instrumented: 
Instruments: 


hi_empunion 


totchr age female blhisp linc ssiratio 


The first part, added because of the first option, reports results from the 
first-stage regression of the endogenous variable hi_empunion on all the 
exogenous variables, here ssiratio and all the exogenous regressors in the 
structural equation. The first-stage regression has reasonable explanatory 
power, and the coefficient of ssiratio 1s negative, as expected, and highly 


statistically significant. In models with more than one endogenous regressor, 
more than one first-stage regression is reported if the first option is used. 


The second part reports the results of intrinsic interest, those from the Iv 
regression of 1drugexp On hi_empunion and several exogenous regressors. 
Supplementary insurance has a big effect. The estimated coefficient of 
hi _empunion is — 0.812, indicating that supplementary-insured individuals 
have out-of-pocket drug expenses that are 56% lower ( e70-812 — 1 = —0.56 ) 
than those for people without employment or union-related supplementary 
insurance. 


7.4.5 IV estimation of an overidentified model 


We next consider estimation of an overidentified model. Then, different 
estimates are obtained by 2sLs estimation and by different variants of GMM. 


We use two instruments, ssiratio and multic, for hi_empunion, the 
endogenous regressor. The first estimator is 2SLS, obtained by using 2s1s 
with standard errors that correct for heteroskedasticity with the vce (robust) 
option. The second estimator is optimal GMM given heteroskedastic errors, 
obtained by using gmm with the wmatrix (robust) option. These are the two 
leading estimators for overidentified Iv with cross-sectional data and no 
clustering of the errors. The third estimator adds i gmm to iterate to 
convergence. The fourth estimator is one that illustrates optimal GMM with 
clustered errors by clustering on age. The final estimator is the same as the 
first but reports default standard errors that do not adjust for 
heteroskedasticity. 


. * Compare five estimators and variance estimates for overidentified models 
. global ivmodel "ldrugexp (hi_empunion = ssiratio multlc) $x2list" 


. qui ivregress 2sls $ivmodel, vce(robust) 

. estimates store TwoSLS 

. qui ivregress gmm $ivmodel, wmatrix (robust) 

. estimates store GMM_het 

. qui ivregress gmm $ivmodel, wmatrix(robust) igmm 
. estimates store GMM_igmm 

. qui ivregress gmm $ivmodel, wmatrix(cluster age) 
. estimates store GMM_clu 

. qui ivregress 2sls $ivmodel 


. estimates store TwoSLS_def 


. estimates table TwoSLS GMM_het GMM_igmm GMM_clu TwoSLS_def, b(7%9.5f) se 


Variable TwoSLS GMM_het GMM_igmm GMM_clu TwoSLS_“~f 
hi_empunion -0.90466 -0.89426 -0.89428 -0.87513 -0.90466 
0.17962 0.17918 0.17918 0.16313 0.17821 

totchr 0.45016 0.44971 0.44971 0.44834 0.45016 
0.01016 0.01014 0.01014 0.01359 0.01038 

age -0.01318 -0.01305 -0.01305 -0.00987 -0.01318 
0.00273 0.00273 0.00273 0.00612 0.00268 

female -0.02153 -0.02084 -0.02084 -0.01282 -0.02153 
0.03095 0.03091 0.03091 0.02696 0.03035 

blhisp -0.21530 -0.21337 -0.21337 -0.18363 -0.21530 
0.03863 0.03856 0.03856 0.04577 0.03805 

linc 0.08891 0.08811 0.08811 0.08544 0.08891 
0.02048 0.02043 0.02043 0.01264 0.02039 

_cons 6.78164 6.77124 6.77129 6.52887 6.78164 
0.23997 0.23954 0.23954 0.48108 0.23478 


Legend: b/se 


Compared with the just-identified Iv estimates of section 7.4.4, the parameter 
estimates have changed by close to 10% (aside from those for the 
statistically insignificant regressor female). The standard errors are little 
changed, except for that for hi_empunion, which has fallen by about 5—13%, 
reflecting an efficiency gain due to additional instruments. 


The differences between 2SLS, optimal GMM given heteroskedasticity, and 
iterated optimal GMM are negligible. Optimal GMM with clustering differs 
more. And the final column shows that the default standard errors for 2SLs 


differ little from the robust standard errors in the first column, reflecting the 
success of the log transformation in eliminating heteroskedasticity. 


7.4.6 Testing for regressor endogeneity 


The preceding analysis treats the insurance variable, hi_empunion, as 
endogenous. If instead the variable is exogenous, then the Iv estimators (Iv, 
2SLS, or GMM) are still consistent, but they can be much less efficient than the 
OLS estimator. 


The Hausman test principle provides a way to test whether a regressor is 
endogenous. If there is little difference between OLS and Iv estimators, then 
there is no need to instrument, and we conclude that the regressor was 
exogenous. If instead there is considerable difference, then we need to 
instrument and the regressor was endogenous. The test usually compares just 
the coefficients of the endogenous variables. In the case of just one 
potentially endogenous regressor with a coefficient denoted by 8, the 
Hausman test statistic 


( IV — ors) 


is y? (1) distributed under the null hypothesis that the regressor is 
exogenous. 


Before considering implementation of the test, we first obtain the OLS 
estimates to compare them with the earlier Iv estimates. We have 


. * Obtain OLS estimates to compare with preceding IV estimates 
. regress ldrugexp hi_empunion $x2list, vce(robust) 


Linear regression Number of obs = 10,068 
F(6, 10061) = 375.46 
Prob > F = 0.0000 
R-squared = 0.1769 
Root MSE = 1.2357 

Robust 
ldrugexp | Coefficient std. err. t P>|t| [95% conf. interval] 
hi_empunion .0732427 . 026008 2.82 0.005 .0222618 . 1242235 
totchr . 4402073 . 009376 46.95 0.000 -4218285 .4585861 
age - .0033708 .0019418 -1.74 0.083 -.0071772 .0004355 
female 0569505 .0253798 2.24 0.025 0072011 . 1066999 
blhisp -.1487419 .0341819 -4.35 0.000 -.2157453 -.0817386 
linc .0116467 .0137461 0.85 0.397 -.0152985 .0385918 
_cons 5.846407 . 1574085 37.14 0.000 5.537855 6.154959 


The OLS estimates differ substantially from the just-identified Iv estimates 
given in section 7.4.4. The coefficient of hi empunion has an OLS estimate of 
0.073, very different from the Iv estimate of — 0.812. This is strong evidence 
that hi_empunion is endogenous. Some coefficients of exogenous variables 
also change, notably, those for age and female. Note also the loss in 
precision in using Iv. Most notably, the standard error of the instrumented 
regressor increases from 0.026 for OLS to (0.188 for Iv, an eightfold increase, 
indicating the potential loss in efficiency due to Iv estimation. 


The hausman command can be used to compute Ty under the assumption 
that V (Bry — Bors) = V (biv) — V (Bors); see section 11.9.5. This greatly 
simplifies analysis because then all that is needed are coefficient estimates 
and standard errors from separate Iv estimation (Iv, 2SLS, or GMM) and OLS 
estimation. But this assumption is too strong. It is correct only if Bars is the 
fully efficient estimator under the null hypothesis of exogeneity, an 
assumption that is valid only under the very strong assumption that model 
errors are independent and homoskedastic. One possible variation is to 
perform an appropriate bootstrap; see section 12.4.6. 


The estat endogenous command implements the related Durbin—Wu— 
Hausman (DWH) test. Because the DwH test uses the device of augmented 
regressors, it produces a robust test statistic (Davidson 2000). The essential 
idea is the following. Consider the model as specified in section 7.3.1. 


Rewrite the structural equation (7.5) with an additional variable, v1, that is 
the error from the first-stage (7.6) for y2. Then, 


Yui = Bryai + X1ib2 + pvr + ui (7.11) 


Under the null hypothesis that y2; is exogenous, E (viiuilyzi, X1;) = 0. If v1 
could be observed, then the test of exogeneity would be the test of 

Ho: p = 0 in the OLS regression of yı on Y2, X1, and V1. Because vı is not 
directly observed, the fitted residual vector v, from the first-stage OLS 
regression (7.6) is instead substituted. 


For independent homoskedastic errors, this test is asymptotically 
equivalent to the earlier Hausman test. In the more realistic case of errors 
that are heteroskedastic or clustered or have some other complication, the 
test of Ho: p = 0 can still be implemented provided that we use the 
appropriate robust variance estimates. This test can be extended to the 
multiple endogenous regressors case by including multiple residual vectors 
and testing separately for correlation of each with the error on the structural 
equation. 


We apply it to our example with one potentially endogenous regressor, 
hi_empunion, instrumented by ssiratio. Then, 


. * Robust DWH test of endogeneity implemented by estat endogenous 
. ivregress 2sls ldrugexp (hi_empunion = ssiratio) $x2list, vce(robust) 


Instrumental variables 2SLS regression Number of obs = 10,068 
Wald chi2(6) = 2039.84 
Prob > chi2 = 0.0000 
R-squared = 0.0831 
Root MSE = 1.3038 

Robust 
ldrugexp | Coefficient std. err. Zz P>|zl [95% conf. interval] 
hi_empunion -.8115015 . 1884024 -4.31 0.000 -1.180763 -.4422396 
totchr -4492135 0100423 44.73 0.000 -429531 . 468896 
age -.0122415 -0027711 -4.42 0.000 -.0176727 -.0068102 
female -.0140507 .0310339 -0.45 0.651 -.0748761 .0467747 
blhisp -.2089573 . 0383529 -5.45 0.000 -.2841276 -.1337869 
linc 0815472 .0207576 3.93 0.000 . 0408631 . 1222314 
-cons 6.692548 . 2449266 27.32 0.000 6.212501 7.172595 


Instrumented: hi_empunion 
Instruments: totchr age female blhisp linc ssiratio 


. estat endogenous 


Tests of endogeneity 
HO: Variables are exogenous 


Robust score chi2(1) 
Robust regression F(1,10060) 


0.0000) 
0.0000) 


24.9812 (p 
24.8525 (p 


The last line of output is the robustified DWH test and leads to strong 
rejection of the null hypothesis that hi_empunion is exogenous. We conclude 
that it is endogenous. 


We obtain exactly the same test statistic when we manually perform the 
robustified DWH test. We have 


* Robust DWH test of endogeneity implemented manually 
qui regress hi_empunion ssiratio $x2list 


qui predict vihat, resid 

qui regress ldrugexp hi_empunion $x2list vihat, vce(robust) 
. test vihat 

( 1) vihat = 0 


FC 1, 10060) 
Prob > F 


24.85 
0.0000 


The estat endogenous command uses the standard errors specified in 
the preceding ivregress command. Thus, for a cluster—robust robustified 


Hausman test, we use estat endogenous following an ivregress, 
vce (cluster clustvar) command. 


7.4.7 Control function estimator 


The immediately preceding OLS regression, one that adds the first-stage 
residuals as a regressor, remarkably leads to parameter estimates identical to 
the Iv estimates. We have 


. * Control function estimator adds first-stage residual as regressors 
. regress ldrugexp hi_empunion $x2list vihat, vce(robust) noheader 


Robust 
ldrugexp | Coefficient std. err. t P>|t| [95% conf. interval] 
hi_empunion -.8115016 . 1795801 -4.52 0.000 -1.163514 - .4594888 
totchr . 4492135 . 0094936 47.32 0.000 - 4306041 . 4678229 
age -.0122415 . 0026367 -4.64 0.000 -.01741 -.0070729 
female -.0140507 . 0293244 -0.48 0.632 -.0715323 . 0434309 
blhisp -. 2089573 . 0364257 -5.74 0.000 -. 280359 -.1375555 
linc .0815472 .0195157 4.18 0.000 .0432925 . 119802 
vihat .9039115 . 1813179 4.99 0.000 . 5484921 1.259331 
_cons 6.692548 . 2334768 28.66 0.000 6.234887 7.150209 


The coefficient estimates, though not the standard errors, are identical to 
ivregress output given in section 7.4.4. Correct standard errors for this 
example that adjust for the first-stage estimation of vihat are obtained in 
section 13.3.11. Alternatively, one can jointly bootstrap both steps. 


This estimator, one based on (7.11), is called a control function estimator 
because the first-stage residual %,, is added as a regressor to control for the 
endogenous variable Y2:, which is also included in the regression. 


This alternative way to obtain Iv estimates provides one method to 
control for endogeneity in some nonlinear models such as the probit and 
Poisson model. 


7.4.8 Tests of overidentifying restrictions 


The validity of an instrument cannot be tested in a just-identified model. But 
it is possible to test the validity of overidentifying instruments in an 


overidentified model provided that the parameters of the model are estimated 
using optimal GMM. The same test has several names, including the 
overidentifying restrictions test, the overidentified test, Hansen’s test, 
Sargan’s test, and the Hansen—Sargan test. 


The starting point is the fitted value of the criterion function (7.10) after 
optimal GMM; that is, 
Q(B) = {(1/N)(y — X)'Z}S-H(1/N)Z' (y — XB)}- If the population 
moment conditions E'{Z'(y — X3)} = 0 are correct, then Z' (y — XB) ~0 
, SO Q(B) should be close to 0. Under the null hypothesis that all instruments 
are valid, it can be shown that Q(B) has an asymptotic chi-squared 
distribution with degrees of freedom equal to the number of overidentifying 
restrictions. 


Large values of Q(B) lead to rejection of Hy: E{Z'(y — XB)} = 0. 
Rejection is interpreted as indicating that at least one of the instruments is 
not valid. Tests can have power in other directions, however, as emphasized 
in section 3.7.6. It is possible that rejection of Ho indicates that the model 
X for the conditional mean is misspecified. 


Going the other way, the test is one of validity only of the 
overidentifying instruments, so failure to reject Hp does not guarantee that 
all the instruments are valid. In particular, with one endogenous regressor, 
the test is one of whether the excess instruments are valid, conditional on the 
untestable assumption that at least one of the instruments is valid. 


The test is implemented with the estat overid postestimation command 
following the ivregress gmm command for an overidentified model. We do 
so for the optimal GMM estimator with heteroskedastic errors and 
instruments, ssiratio and multic. The example below implements estat 
overid under the overidentifying restriction. 


. * Test of overidentifying restrictions following ivregress gmm 
. qui ivregress gmm ldrugexp (hi_empunion = ssiratio multlc) $x2list, 
> wmatrix (robust) 


. estat overid 
Test of overidentifying restriction: 


Hansen’s J chi2(1) = 1.65198 (p = 0.1987) 


The test statistic is y?(1) distributed because the number of overidentifying 
restrictions equals 2 — 1 = x11. Because p > 0.05, we do not reject the null 
hypothesis and conclude that the overidentifying restriction is valid. 


A similar test using all four available instruments yields 


. x Test of overidentifying restrictions following ivregress gmm 
ivregress gmm ldrugexp (hi_empunion = ssiratio lowincome multlc firmsz) 
> $x2list, wmatrix(robust) 


Instrumental variables GMM regression Number of obs = 10,068 

Wald chi2(6) = 2053.18 

Prob > chi2 = 0.0000 

R-squared = 0.0894 

GMM weight matrix: Robust Root MSE = 1.2992 
Robust 

ldrugexp | Coefficient std. err. Zz P>lz| [95% conf. interval] 

hi_empunion -.7809169 . 1690563 -4.62 0.000 -1.112261 -.4495727 

totchr . 4488934 . 0099905 44.93 0.000 . 4293123 . 4684745 

age -.0120102 . 0026498 -4.53 0.000 -.0172037 -.0068166 

female -.0087044 . 0300802 -0.29 0.772 - .0676604 .0502516 

blhisp -. 2014736 .037838 -5.32 0.000 -.2756347 -.1273125 

linc .0783428 0195822 4.00 0.000 .0399624 . 1167232 

_cons 6.669519 . 2320541 28.74 0.000 6.214701 7.124336 


Instrumented: hi_empunion 
Instruments: totchr age female blhisp linc ssiratio lowincome multlc firmsz 


. estat overid 
Test of overidentifying restriction: 
Hansen’s J chi2(3) = 11.0269 (p = 0.0116) 


Now we reject the null hypothesis at level 0.05, but we do not reject at level 
0.01, though the decision is marginal. Despite this rejection, the coefficient 
of the endogenous regressor hi_empunion is — 0.781, not all that different 
from the estimate when ssiratio is the only instrument. 


7.4.9 IV estimation using the eregress command 


The eregress command, detailed in section 23.7, is a command for 
extended regression that fits a linear regression while controlling for one or 
more of the following complications: endogenous regressor, endogenous 
sample selection, and nonrandom treatment assignment. 


The command provides maximum likelihood (ML) estimates under the 
assumption of normally distributed errors. With an endogenous regressor, the 
eregress command yields the LIML estimator presented in section 7.9.1. In 
the just-identified case, the LIML estimator and 2SLS estimators are equivalent 
because they reduce to the Iv estimator. 


The following command yields the same Iv results for the example of 
section 7.4.4 as those obtained using the ivregress command because the 
model is just identified. 


. * IV estimation using the eregress command in a just-identified model 
. eregress ldrugexp $x2list, endogenous(hi_empunion = $x2list ssiratio) vce(robust) 


(output omitted ) 
7.4.10 IV estimation with a binary endogenous regressor 


In our example, the endogenous regressor hi_empunion is a binary variable. 
The Iv methods we have used are valid under the assumption that 

E(u;|z;) = 0, which in our example means that the error in the structural 
equation for 1drugexp has a mean of zero conditional on the exogenous 
regressors X1 in (7.5) and on any instruments X2 such as ssiratio. The 
reasonableness of this assumption does not change when the endogenous 
regressor hi_empunion is binary. 


An alternative approach to linear Iv adds more structure to explicitly 
account for the binary nature of the endogenous regressor by changing the 
first-stage model to be a latent-variable model similar to the probit model 
presented in section 17.2. Let yı depend in part on Y2, a binary endogenous 
regressor. We introduce an unobserved latent variable, y5, that determines 
whether y> = 1 or 0. The models (7.5) and (7.6) become 


Yii = Biya: + X41, Bq + Ui (7.12) 
Yi = XyW1j + XQ Maj + Vi 


1 wy, > 0 
m={ ee 


0 otherwise 


The errors (u;,v;) are assumed to be correlated bivariate normal with 
Var(u;) = 07, Var(v;) = 1, the standard normalization for a probit model, 


and Cov(u;, vi) = po’. 


The binary endogenous regressor y2 can be viewed as a treatment 
indicator. If y2 = 1, we receive treatment (here access to employer- or 
union-provided insurance), and if y2 = 0, we do not receive treatment. The 
Stata documentation refers to (7.12) as the treatment-effects model, though 
the treatment-effects literature is vast and encompasses many models and 
methods. 


The etregress command fits (7.12) by ML, the default, or by two-step 
methods. The basic syntax is 


etregress depvar | indepvars | , treat(depvar_t = indepvars_t) | twostep | cfunction | 


where depvar is Yı, indepvars is X1, depvar_t is y5, and indepvars_t is X1 
and Xo. 


We apply this estimator to the exactly identified setup of section 7.4.4, 
with the single instrument ssiratio. We obtain 


* Regression with a binary endogenous regressor 
. etregress ldrugexp $x2list, treat(hi_empunion = ssiratio $x2list) vce(robust) 


Iteration 0: log pseudolikelihood = -22666.718 
Iteration 1: log pseudolikelihood = -22655.915 
Iteration 2: log pseudolikelihood = -22654.399 
Iteration 3: log pseudolikelihood = -22654.396 
Iteration 4: log pseudolikelihood = -22654.396 


Linear regression with endogenous treatment Number of obs = 10,068 
Estimator: Maximum likelihood Wald chi2(6) = 1876.40 
Log pseudolikelihood = -22654.396 Prob > chi2 = 0.0000 
Robust 
Coefficient std. err. z P>|z| [95% conf. interval] 
ldrugexp 
totchr . 4550432 .0108265 42.03 0.000 . 4338237 .4762627 
age -.0179835 . 0024374 -7.38 0.000 -.0227606 -.0132064 
female -.06001 . 0303896 -1.97 0.048 -.1195725 -.0004475 
blhisp -. 2479348 .039701 -6.25 0.000 -.3257474 -.1701222 
linc . 1267941 .0186249 6.81 0.000 .09029 . 1632981 
1.hi_empunion -1.384199 .1029986 -13.44 0.000 -1.586072 -1.182325 
_cons 7.240257 . 2033296 35.61 0.000 6.841738 7.638776 


hi_empunion 


ssiratio -.5506714 .0400378 -13.75 0.000 -.629144 -.4721987 
totchr .0405104 .0100773 4.02 0.000 .0207593 . 0602616 
age -.0242543  .0020402 -11.89 0.000 -.0282531  -.0202556 
female -.1924384  .0262529 -7.33 0.000 -.2438931 -.1409837 
blhisp -.1928805 .0358713 -5.38 0.000 -.2631871 -. 122574 
linc .1278699 .0161179 7.93 0.000 .0962794 . 1594604 
-cons 1.508993 .1649894 9.15 0.000 1.18562 1.832367 
/athrho .7639572 .0612615 12.47 0.000 . 6438869 . 8840276 
/lnsigma .3460358 .0197529 17.52 0.000 . 3073209 . 3847507 
rho .6434019 .0359013 .5675403 . 7084313 
sigma 1.413453 .0279197 1.359777 1.469248 
lambda .9094185 .0672652 .7775812 1.041256 
Wald test of indep. eqns. (rho = 0): chi2(1) = 155.51 Prob > chi2 = 0.0000 


The key output is the first set of regression coefficients. Compared with 1v 
estimates in section 7.4.4, the coefficient of hi_empunion has increased in 
absolute value from — 0.812 to — 1.384, and the standard error has fallen 
greatly from 0.188 to 0.103. The coefficients and standard errors of the 
exogenous regressors change much less. 


The quantities rho, sigma, and lambda denote, respectively, pP, o, and po. 
To ensure that G > 0 and |p| < 1, etregress estimates the transformed 
parameters 0.5 x In{(1 + p)/(1 — p)}, reported as /athrho, and Ino, 
reported as /1nsigma. If the error correlation p = 0, then the errors u and v 
are independent, and there is no endogeneity problem. The last line of output 
clearly rejects Ho: p = 0, SO hi_empunion is indeed an endogenous 
regressor. 


Further analysis of this example, one in a treatment-effects framework, is 
provided in section 25.4.2. 


Which method is better: regular Iv or the latent variable model (7.12)? 
Intuitively, (7.12) imposes more structure. The benefit may be increased 
precision of estimation, as in this example. The cost is a greater chance of 
misspecification error. If the errors are heteroskedastic, as is likely, the Iv 
estimator remains consistent, but the treatment-effects estimator given here 
becomes inconsistent. 


More generally, when regressors in nonlinear models, such as binary- 
data models and count-data models, include endogenous regressors, there is 
more than one approach to model estimation; see also sections 17.9, 20.7, 
and 23.7. 


7.4.11 IV as local average treatment-effects estimator 


When both the endogenous regressor Y2 and its single instrument £2 are 
binary, the Iv estimator can be reinterpreted as a local average treatment- 
effects (LATE) estimator that allows the effect of the endogenous regressor to 
vary across individuals. Then the Iv estimator gives the average effect of the 
binary endogenous regressor, where the average is across a subgroup of the 
population called compliers and it is assumed that there are no defiers. 
Compliers and defiers constitute notional, not observable, categories; they 
are required to identify the treatment effect, that is, to interpret which group 
the treatment effect refers to. 


For example, suppose we are interested in the effect of attending a 
charter school on an outcome ¥1 such as exam performance. Let y2 = 1 ifa 
student attends a charter school and y2 = 0 otherwise, and let the instrument 


x2 = 1 if the student wins a lottery to attend a charter school and x2 = 0 
otherwise. Then a complier is someone who attends the charter school only 
if he or she wins the lottery; that is, y2 = 1 if zə = 1 and yo = 0 if z2 = 0. 
And a defier is someone who attends the charter school only if he or she 
loses the lottery; that is, yo = 1 if zə = 0 and yo = Oif zə = 1. 


The formal definition of compliers and defiers, and this very important 
interpretation and use of Iv in the treatment evaluation literature, is deferred 
to section 25.5. It can explain why different binary instruments may lead to 
different Iv estimates, even asymptotically; hence the use of the adjective 
“local”. 


7.5 Weak instruments 


In this section and the subsequent two sections, we consider inference for Iv 
estimation when the correlation between instruments and endogenous 
regressors is weak enough that standard asymptotic theory may provide a 
poor guide to actual finite-sample distributions, even in samples that have 
many observations and assuming that the chosen instruments are valid, so Iv 
estimators remain consistent. The discussion is lengthy. For the common 
case of a just-identified model with a single endogenous regressor it is 
standard to use the Anderson—Rubin Wald test given in section 7.7 


This section provides a brief overview of the essential issues, a 
simulation exercise, and discusses finite-sample bias. Several weak- 
instrument diagnostics and tests are presented in section 7.6, most notably 
use of the first-stage F statistic. Inferential methods when instruments are 
weak are presented in section 7.7, most notably Anderson—Rubin Wald tests. 


Any study using Iv methods should recognize the potential of weak 
instruments. The literature is vast; Andrews, Stock, and Sun (2019) provide 
a recent survey. A major complication is that the first-order asymptotic 
equivalence of estimators such as 2SLS and LIML and of Wald, Lagrange 
multiplier (LM), and likelihood-ratio (LR) tests disappears when instruments 
are weak. We focus on IV (or 2SLS) estimators and Wald tests because these 
are the most commonly used methods in practice. 


7.5.1 Weak instruments essentials 


When instruments are potentially weak, there are two general approaches. 
Empirical studies to date have used a two-step approach that first determines 
whether instruments are weak and then proceeds to use standard asymptotic 
theory if instruments are determined not to be weak. A better theoretically 
grounded approach, one developed more recently, bases inference on an 
alternative asymptotic theory that is valid regardless of whether instruments 
are weak. 


For simplicity, we focus on the most common case of a single 
endogenous regressor. Then, using the model setup of section 7.3.2, we have 
structural equation yı = 3,;yo + x4 B> + u and first-stage equation 
Y2 = X T1 + XT +V. 


To see the weak instrument problem, note that if the only regressor is the 
endogenous variable y2 and there is only one instrument x2, then BLI is the 
sample analog of Cov (x2, yı /Cov (£2, y2) with problematic distribution 
when the denominator in the ratio, the covariance between the instrument 
and the endogenous regressor, is too close to zero. 


The two-stage method first tests the strength of the instruments X2 in the 
first-stage model by computing the F statistic of the hypothesis that 72 = 0. 
If F exceeds a threshold F, then the instrument is viewed as strong enough 
that we can perform regular inference on the Iv estimate of 6 in the 
structural equation. An initial popular choice of the threshold was F = 10; 
which corresponds in the just-identified case to deciding that an instrument 
is not weak if its ¢ statistic in the first-stage regression exceeds \/10 = 3.16. 
This threshold of 10 is now felt to be too low to conclude that there may be 
no weak instrument problem. Section 7.6.4 presents more formal tests based 
on the F statistic. 


The problem with this approach is that a nominal 5% test at the second 
stage has true size substantially larger than 5% because of the first-stage F 
test; see sections 7.6.4 and 7.7.1. 


The alternative better approach uses asymptotic theory that is valid 
regardless of whether instruments are weak. A standard method is the 
Anderson—Rubin test. Substituting the first-stage equation into the structural 
equation yields the reduced-form equation for yı as 
yı = x (BiTi + Bo) + x5Cime2 + (u + 61v). Standard inference methods 
can be applied to OLS estimates of this equation because the estimates do not 
involve y2. The Anderson—Rubin test of whether 6, = 0 is a standard Wald 
F test of the joint statistical significance of the instruments X2; see 
section 7.7.2. 


The Anderson—Rubin approach can be applied using robust standard 
errors and is optimal in the case of a just-identified model with a single 


endogenous regressor. It has low power in overidentified models, however, 
leading to alternative methods detailed in section 7.7 that are still being 
developed. 


7.5.2 Finite-sample properties of IV estimators 


Even when Iv estimators are consistent, they are generally biased in finite 
samples. This result has been formally established in overidentified models. 
In just-identified models, the first moment of the Iv estimator does not exist, 
but for simplicity, we follow the literature and continue to use the term 
“bias” in this case. 


The finite-sample properties of Iv estimators are complicated. However, 
there are three cases in which it is possible to say something about the finite- 
sample bias of Iv; see Davidson and MacKinnon (2004, chap. 8.4). 


First, when the number of instruments is very large relative to the sample 
size and the first-stage regression fits very well, the Iv estimator may 
approach the OLS estimator and hence will be similarly biased. This case of 
many instruments is not very relevant for cross-sectional microeconometrics 
data, where usually few instruments are available, though it can be relevant 
for panel-data Iv estimators such as Arellano—Bond. 


Second, when the correlation between the structural-equation error u in 
(7.5) and some components of the vector v in (7.6) of first-stage—equation 
errors is high, then asymptotic theory may be a poor guide to the finite- 
sample distribution. 


Third, if we have weak instruments in the sense that one or more of the 
first-stage regressions have a poor fit, with m2; ~ O in (7.6), then asymptotic 
theory may also provide a poor guide to the finite-sample distribution of the 
Iv estimator, especially when the sample is small but possibly even if the 
sample has thousands of observations. 


In what follows, our main focus will be on the third source of finite- 
sample bias of Iv, that due to weak instruments. More precise definitions of 
weak instruments are considered later in this section. 


7.5.3 A Monte Carlo example 


To expand on the results stated in the preceding subsection, we now consider 
a simple Monte Carlo example with equations for two endogenous variables 
denoted yı and ¥2, respectively, and an exogenous instrumental variable for 
y2, denoted z. Specifically, we have 


yı = y2 +u 
Y2 = TZ +V 


where (u, v) are random shocks. The parameter 7 measures the strength 
of the linear association between Y2 and 2; small values of 7 imply that z is a 
weak instrument for Y2. The (“causal”) parameter 8 is the key object of 
inference. 


The reduced-form equation for ¥1 is 


yi = B(nz+vu)+u 
= prz + (8u +u) 


A key determinant of weak instrument bias is the concentration 
parameter +2, with general definition given below in (7.15). For the current 
simulation setup, 


T? =Nr x E(z*) xr = Nr’ E(z’) 


Weak instrument bias increases as +2 decreases. The F statistic from first- 
stage regression of y2 on z is asymptotically equivalent to 72 + 1. 


We consider two cases. The first is a recursive model that results from 
the noncorrelation assumption that Æ(uv) = 0. In this case, we may treat y2 
as exogenous when estimating the first equation, and 3 can be consistently 
estimated by OLS. Then the IV estimator is defined as Bry = D zyi/ >>, zy2 
, which is a consistent estimator but is not efficient relative to Bouts: 


replications in all simulations is set at 1,000. 


In the second case, we assume that the random shocks (u, v) are 
contemporaneously correlated. In this case, E(y2u) 4 0; hence, Bors is 
inconsistent (asymptotically biased), and Bw is consistent (asymptotically 
unbiased). However, in small samples, Biv may also be biased. 


To throw additional light on the source and nature of the bias, we carry 
out a simple simulation experiment. The data are generated under the 
following assumptions: 3 = 2, the sample size NV equals 100 (small sample) 
or 1,000 (large sample), and m equals 0.1 (“weak” instrument) or equals 0.5 
(“strong” instrument). In the first set of simulations, u and v are independent 
standard normals, and in the second set, (u, v) are drawn from a standard 
bivariate normal distribution with correlation of 0.5. The number of 


The simulation program, set up to provide summary statistics for the 
parameter 8, is given below. Its structure is similar to the program in 


section 5.6.1. The program can be easily changed to simulate alternative 
settings of 6, 7, Cor(u, v), and N. 


. * Program for weak instruments simulation 


. clear all 


. global numsims 1000 


. program weakivsim, rclass 


. end 


version 17 

drop _all 

set obs $numo 
generate z = 
generate v = 
generate u = 
generate y2 
generate yl = 
regress y1 y2 
return scalar 
return scalar 
return scalar 


return scalar 
return scalar 
return scalar 
regress y2 z 

return scalar 


bs 


rnormal() 
rnormal() 


rnormal() + $ecor*v 


O + $pi*z + v 


O + 2*$pitz + (ut2*v) 


bols 
seol 
tols 


biv 
seiv 


tiv 


Fiv 


S 


_b[y2] 
= _se[y2] 


// 


// 
// 
// 


(_b[y2]-2)/_se [y2] 
ivregress 2sls y1 (y2=z) 


_b[y2] 
-se [y2] 


(_b[y2]-2)/_se [y2] 


e(F) 


// 


Set sample size 


corr (u,v)=ecor/sqrt (1+ecor~2) 
Set instrument strength 
beta = 2 


First-stage regression 


7.5.4 Simulation results with uncorrelated errors 


We begin with a large sample. Then both OLs and rv are consistent, and OLS is 
fully efficient given this simulation design. 


. * Simulation 1: beta=2, large sample, weak IV, independent errors 
. global numobs 1000 // Large sample 


. global pi 0.1 // Weak IV 
. global ecor 0.0 // correlation(u,v)=0 


. simulate bols=r(bols) seols=r(seols) tols=r(tols) biv=r(biv) seiv=r(seiv) 
> tiv=r(tiv) Fiv=r(Fiv), seed(10101) reps($numsims) nolegend nodots: 
> weakivsim 


. mean bols seols biv seiv Fiv 


Mean estimation Number of obs = 1,000 
Mean Std. err. [95% conf. interval] 
bols 2.000625 .0010369 1.998591 2.00266 
seols .0314616 . 0000309 .031401 .0315222 
biv 1.989567 .0165767 1.957038 2.022096 
seiv .5107373 .061441 .3901691 .6313055 
Fiv 11.06388 .2105104 10.65079 11.47698 
. display "Concentration parameter = " $numobs*$pi*1*$pi // As E[z^2]=1 


Concentration parameter = 10 


Both estimators appear to be unbiased, with the average value close to 2.0 
across the simulations. The Iv estimator is much less efficient, with an 
average standard error that is 16 times (= 0.511/0.031) that of OLs. So a 
weak instrument has greatly reduced efficiency but has not induced finite- 
sample bias problems with N = 1000. 


Next, we move to a smaller sample, reducing N from 1,000 to 100. 


. * Simulation 2: beta=2, small sample, weak IV, independent errors 
. global numobs 100 // Small sample 


. global pi 0.1 // Weak IV 
. global ecor 0.0 // correlation(u,v)=0 


simulate bols=r(bols) seols=r(seols) tols=r(tols) biv=r(biv) seiv=r(seiv) 
> tiv=r(tiv) Fiv=r(Fiv), seed(10101) reps($numsims) nolegend nodots: 
> weakivsim 


. mean bols seols biv seiv Fiv 


Mean estimation Number of obs = 1,000 
Mean Std. err. [95% conf. interval] 
bols 2.000745 .00311 1.994642 2.006847 
seols . 1010183 . 0003322 . 1003665 .1016701 
biv 2.374817 .3618055 1.664832 3.084803 
seiv 137.1123 43.71515 51.32823 222.8963 
Fiv 2.132318 . 0862476 1.963071 2.301565 
. display "Concentration parameter = " $numobs*$pi*1*$pi // As E[z^2]=1 


Concentration parameter = 1 


The OLS estimator remains unbiased though, as expected, the average 
standard error is ,/10 times larger because the sample is one-tenth as large. 
The Iv estimator is now an extremely noisy estimator with average standard 
error of 137 and very wide 95% simulation interval of [1.66, 3.08], though 
this does include the DGP value of 2. 


We repeat this small-sample simulation with r increased from 0.1 to 0.5. 
We obtain 


. * Simulation 3: beta=2, small sample, strong IV, independent errors 
. global numobs 100 // Small sample 


. global pi 0.5 // Strong IV 
. global ecor 0.0 // correlation(u,v)=00 
. Simulate bols=r(bols) seols=r(seols) tols=r(tols) biv=r(biv) seiv=r(seiv) 


> tiv=r(tiv) Fiv=r(Fiv), seed(10101) reps($numsims) nolegend nodots: weakivsim 


. mean bols seols biv seiv Fiv 


Mean estimation Number of obs = 1,000 
Mean Std. err. [95% conf. interval] 
bols 2.000864 . 0028334 1.995304 2.006424 
seols .0906651 . 0002983 .0900798 .0912503 
biv 2.000488 0073413 1.986082 2.014894 
seiv . 2151584 .0020243 . 2111861 . 2191308 
Fiv 26.79966 . 3818293 26.05038 27 .54894 
. display "Concentration parameter = " $numobs*$pi*1*$pi // As E[z^2]=1 


Concentration parameter = 25 


Now the Iv estimator is well behaved with average closely centered on 2.0 
and a small average standard error that is only 2.4 times (= 0.215/0.091 
) that of OLS, so the efficiency loss is not as great. 


7.5.5 Simulation results with correlated errors 


We next present simulations where the OLS estimator is inconsistent because 
of correlation of the errors, leading to y2 becoming an endogenous regressor 
in the yı equation. The Iv estimator remains consistent but is no longer 
centered on 2 in small samples. 


We first consider a large sample, with weak instrument and error 
correlation of 0.5. We obtain 


. * Simulation 4: beta=2, large sample, weak IV, correlated errors 
. global numobs 1000 // Large sample 


. global pi 0.1 // Weak IV 
. global ecor 0.5775 // Implies correlation(u,v)=.5775/sqrt (1+.5775°2)=0.5 


. Simulate bols=r(bols) seols=r(seols) tols=r(tols) biv=r(biv) seiv=r(seiv) 
> tiv=r(tiv) Fiv=r(Fiv), seed(10101) reps($numsims) nolegend nodots: 
> weakivsim 


. mean bols seols biv seiv Fiv 


Mean estimation Number of obs = 1,000 
Mean Std. err. [95% conf. interval] 
bols 2.572409 .0010378 2.570372 2.574445 
seols .0315148 . 0000309 .0314541 .0315755 
biv 1.869067  .0299957 1.810205 1.927929 
seiv . 7937734 . 2411204 . 3206129 1.266934 
Fiv 11.06388 .2105104 10.65079 11.47698 
. display "Concentration parameter = " $numobs*$pi*1*$pi // As E[z^2]=1 


Concentration parameter = 10 


As expected, the OLS estimator is inconsistent. It is centered a long way from 
2.0, with average value of 2.572 and average standard error of 0.032. (This 
simulation shows only bias, but if we make the sample size very large, the 
OLS estimates are still a long way from 2.0.) 


The Iv estimator is closer to being centered on 2.0, but the 95% 
simulation interval of [1.810, 1.928] does not include 2.0. With both 
appreciable endogeneity and a weak instrument, the Iv estimator displays 
bias even with N = 1000. 


Finally, we verify that Iv estimation works well with appreciable 
endogeneity if the instrument is strong. We increase 7 from 0.1 to 0.5. 


. * Simulation 5: beta=2, small sample, strong IV, correlated errors 
. global numobs 100 // Small sample 


. global pi 0.5 // Strong IV 
. global ecor 0.5775 // Implies correlation(u,v)=.5775/sqrt (1+.5775°2)=0.5 


. Simulate bols=r(bols) seols=r(seols) tols=r(tols) biv=r(biv) seiv=r(seiv) 
> tiv=r(tiv) Fiv=r(Fiv), seed(10101) reps($numsims) nolegend nodots: 
> weakivsim 


. mean bols seols biv seiv Fiv 


Mean estimation Number of obs = 1,000 
Mean Std. err. [95% conf. interval] 
bols 2.461669 .0029383 2.455903 2.467435 
seols .0936148 .0003095 . 0930074 .0942222 
biv 1.974495 .0088162 1.957195 1.991796 
seiv . 2533962 . 0033324 . 2468569 . 2599356 
Fiv 26.79966 .3818293 26 .05038 27 .54894 
. display "Concentration parameter = " $numobs*$pi*1*$pi // As E[z^2]=1 


Concentration parameter = 25 


The OLS estimator is again clearly inconsistent. The Iv estimator is close to 
being centered on 2.0 with 95% simulation interval of [1.957, 1.992]. 


These simulation results suggest that with weak instruments, the Iv 
estimator is biased, which will in turn lead to tests of incorrect size and 
confidence intervals with incorrect coverage. The bias increases as the 
concentration parameter decreases. 


For a single endogenous regressor, the concentration parameter defined 
below in (7.15) is asymptotically equal to F — 1, where F is the first-stage 
F Statistic. In simulations 1—5, the design sets the concentration parameter 
to, respectively, 10, 1, 25, 10 and 25, and the simulation average of the 
corresponding F statistics were, respectively, 11.06, 2.13, 26.80, 11.06, and 
26.80. 


7.5.6 The first-stage F statistic 


The most obvious diagnostic for whether an instrument is weak is to test 
whether it is weakly correlated with the endogenous regressor. 


For Iv estimation that uses more than one instrument, we can consider 
the joint correlation of the endogenous regressor with the several 
instruments. Possible measures of this correlation are R2 from regression of 
the endogenous regressor Y2 on the several instruments x2, and the F 
statistic for test of overall fit in this regression. Low values of R2 or F are 
indicative of weak instruments. 


This analysis neglects the presence of the structural-model exogenous 
regressors X in the first-stage regression (7.6) of y2 on both x2 and xı. If 
the instruments X2 add little extra to explaining yı after controlling for x1, 
then the instruments are weak. 


One commonly used diagnostic is therefore the F statistic for joint 
significance of the instruments X2 in first-stage regression of the endogenous 
regressor Y2 on X2 and X1. This is a test that mə = O in (7.6). When we 
collect all observations, the first-stage model is yo = Xı mı + Xomo + vo. 
After partialing out the effect of x1 on Y2 and of xı on X2, we reduce the 
first-stage model to 


Yo = Xow. + Vo (7.13) 


where Yo; are residuals from OLS of Y2i on X1; and X»; are the residuals from 
OLS of X2; On Xii. 


Then with 1.1.d. errors, a test of mə = O when ¥2 is scalar is based on 
F= (ZX, Xñ /Ka) [52% (7.14) 


This is called the first-stage F statistic. Section 7.6.4 presents tests of 
whether F is low enough to indicate that instruments are likely to be weak. 
Section 7.7.1 discusses the limitations of using conventional inference after 
first screening on the F statistic. 


7.5.7 The first-stage F statistic and 2SLS bias 


The concentration parameter 
= (,X2Xj7r2) Jaz (7.15) 


is the population analog of the first-stage F statistic given in (7.14), aside 
from division by the number of instruments (Kə). It can be shown that 
F — 1 is an approximately unbiased estimate of 7? / Ko. 


Theory shows that finite-sample bias of the Iv estimator increases as +2 
decreases and that 7 plays the role of sample size. 


To see this, consider the model y, = y2 + u, with K3 instruments so 


y2 = z'7m + v. Then the bias of the 2sLs estimator is approximately 


Ou 7 1 
Oy (T?/K2)+1 


E (srs) = p = Cor(u,v) x (7.16) 


See Angrist and Pischke (2009, 207), where strictly speaking E( Boss) 
exists only if Kə > 2. 


There are several key results. First, the bias of 2SLs increases with the 
correlation of the two errors, but this is why we needed to do 2SLs in the first 
place. Second, the bias of 2sLs is likely to get larger as more instruments are 
added (unless they substantially increase the F statistic). Third, the bias of 
2SLS is greater the smaller the sample size N because the concentration 
parameter depends on X, X4, which is the sum of N terms. The preceding 
simulations varied N and Cov(u, v). 


Result (7.16) shows the importance of the concentration parameter +2, 
which grows with sample size N at rate N. The weak instruments 
asymptotics considered later consider the limit distribution of 
JT (Boss — 8), where + = c//N for ca constant, rather than the usual 


VN(Bos.3 — 8), because then +2 does not grow with the sample size. 


As already noted, it can be shown that F — 1 is an approximately 
unbiased estimate of r?/ Ky. So (7.16) implies 


: o 1 
2SLS bias ~ Cor(u, v) x — x = 
O F 


This shows the importance of the first-stage F statistic in signaling likely 
bias in the 2SLS estimator. 


Finally, we note that throughout we have assumed that instruments are 
uncorrelated with the structural equation error, so Cor(x2, u) = 0 in (7.5)— 
(7.6) because this is necessary for the 2SLS estimator to be consistent. If 
instead there is even mild correlation between X2 and u, and additionally the 
instruments are weak, so Cor(x2, y1) is low, then the 2SLs estimator can be 
even more inconsistent than the OLS estimator; see Bound, Jaeger, and 
Baker (1995) or Cameron and Trivedi (2005, 106—107). This increased 
fragility of the assumption that the instrument is valid when instruments are 
weak is rarely emphasized. 


7.5.8 Test size distortion 


When instruments are weak, standard asymptotic tests have size distortion 
because of the aforementioned bias of the 2SLS estimator and also because 
the asymptotic normal distribution provides a poor approximation to the 
distribution of the 2SLS estimator. 


So the standard tests on the coefficients of endogenous regressors in the 
structural model have the wrong size, and the associated confidence intervals 
have the wrong coverage probabilities. Section 7.7 presents valid inference 
methods for the coefficients when instruments are weak. Additionally, the 
usual implementations of Hausman tests for endogeneity and 
overidentification tests will also have the wrong size if instruments are weak. 


7.6 Diagnostics and tests for weak instruments 


There are several approaches for assessing whether there is a weak- 
instrument problem, based on analysis of the first-stage reduced-form 
equations and, particularly, the F statistic for the joint significance of the 
key instruments. 


Initial work, and the ivregress postestimation command estat 
firststage, presented in section 7.6.5, relies on the assumption of 
i.i.d. errors. Relaxing this assumption is the subject of ongoing research, and 
some of the methods presented here may be supplanted. 


Note that this section focuses solely on detecting whether instruments are 
weak. The subsequent section presents valid inference methods with 
potentially weak instruments. 


In particular, a 5% Wald test of structural model parameters has true size 
considerably greater than 5% if it is preceded by a screening test for weak 
instruments. If instruments are thought to be potentially weak, then current 
research is advising to skip screening and move directly to the inference 
methods presented in section 7.7. 


7.6.1 Pairwise correlations 


The simplest method is to use pairwise correlations between any endogenous 
regressor and instruments. For our example, we have 


* Correlations of endogenous regressor with instruments 
. qui use mus207mepspresdrugs, clear 


correlate hi_empunion ssiratio lowincome multlc firmsz if linc!=. 
(obs=10,068) 


hi_emp~n ssiratio lowinc”e multic firmsz 


hi_empunion 1.0000 
ssiratio -0.2243 1.0000 
lowincome -0.1164 0.2676 1.0000 
multic 0.1199 -0.1982 -0.0625 1.0000 
firmsz 0.0374 -0.0464 -0.0082 0.1873 1.0000 
. display "Concentration parameter = " $numobs*2*$pi/(2°2+2*2*$ecor+1) 


Concentration parameter = 13.679891 


The gross correlations of instruments with the endogenous regressor 
hi _empunion are low. This will lead to considerable efficiency loss using Iv 
compared with OLS. But the correlations, aside perhaps from 0.037 for 
firmsz, are not so low as to immediately flag a problem of weak 
instruments. 


7.6.2 Partial R2 


We are interested in the correlation between the endogenous regressor and 
the instruments after controlling for the presence of other regressors. 


One diagnostic is the partial R2 between Y2 and X2 after controlling for 
X1. This is the R? from OLS regression (7.13). For structural equations with 
more than one endogenous regressor and hence more than one first-stage 
regression, a generalization called Shea’s partial R2 has been proposed. 
There is no consensus on how low of a value indicates a problem, however, 
so this diagnostic is less often used. 


7.6.3 Tests of underidentification 


Identification requires that the instruments be correlated with the 
endogenous regressors. 


In the simplest case of a single endogenous variable, this can be tested 
using the first-stage F statistic defined in (7.14). The Cragg and 


Donald (1993) minimum eigenvalue statistic is a generalization to multiple 
endogenous variables, defined in Stock and Yogo (2005, 84) or in 
[R] ivregress postestimation, that reduces to the first-stage F statistic when 


Kleibergen and Paap (2006) proposed an alternative test statistic that 
reduces to the Cragg—Donald test when errors are 1.1.d. but can be applied to 
models with robust standard errors (and to nonlinear GMM). 


The problem with these tests is that they are too weak. In the simplest 
case of a single endogenous regressor, single instrument, and N large, they 
amount to rejecting the null hypothesis of underidentification at level 0.05 if 
the first-stage F = 3.84 or, equivalently, the single instrument has t = 1.96. 


7.6.4 Tests of weak instruments 


Small values of the first-stage F statistic indicate weak instruments. There is 
no clear critical value for the F statistic defined in (7.14), because it depends 
on the criteria used to determine whether instruments are weak, on the 
number of endogenous variables, and on the number of overidentifying 
restrictions (excess instruments). 


Stock and Yogo (2005) proposed two approaches for testing whether 
instruments are weak based on the size of the first-stage F statistic when 
errors are 1.1.d. Few results are available in the more common case of 
heteroskedastic or clustered errors; a notable exception is the test of 
Montiel Olea and Pflueger (2013) given later in this subsection. Test 
rejection is interpreted as meaning there is a weak instrument problem. 


We focus on the 2SLs estimator. The tests have been adapted to other 
estimators that control for endogenous regressors, most notably the LIML 
estimator presented in section 7.9.1. 


Relative bias of 2SLS to OLS test with i.i.d. errors 


The first approach considers the relative bias of 2SLs to OLS estimators of the 
endogenous regressor coefficient(s) G,, where 2SLs has finite-sample bias 


that disappears asymptotically while OLS is always biased. For a single 
endogenous regressor, the relative bias is 


p2 fE (41,2s18) = ay 


| {8 (Bors) -aF 


We choose a tolerable level of relative bias, say, 10% (B = 0.10), and test at 
the 5% significance level the null hypothesis that B < 0.10. The critical 
values of F vary with B, the number of endogenous regressors (m), and the 
number of instruments (K2). A major restriction is that this test can be 
applied only if there are at least two overidentifying restrictions ( 


Wald test distortion with i.i.d. errors 


The second approach considers the size distortion in the two-sided Wald 
statistic of G, = 0. For a single endogenous regressor, the true size of a 
nominal 5% significance level Wald test is 


R = Pr (|Wosrs| > 1.96), where Wəsts = Bi asus/ |? (41,2s18) 


where R = 0.05 under standard asymptotics but R > 0.05 under weak- 
instrument asymptotics. We choose a tolerable level of size distortion, say, 
that true test size is at most 10%, and test at the 5% significance level the 
null hypothesis that R < 0.10. The critical values of F vary with R, m, and 
K 2 > m. 


Stock and Yogo critical values for tests 


Stock and Yogo (2005) provided tables of critical values for the two tests 
under the assumption of 1.1.d. errors. Test critical values vary with the 


specified R or B, with the number of endogenous regressors (m), and with 
the number of exclusion restrictions (K2). Stata commands presented below 
lead to output that includes these critical values. 


The critical values are obtained using weak-instrument asymptotics, or 
local-to-zero asymptotics, that set first-stage coefficients 7 = c/v N, where 
c is a constant. Then the concentration parameter 7? in (7.15) remains 
constant as N — oo. The tests are conservative, leading to overrejection, 
because the distribution of both B2 and R additionally depend on 
p = Cor(u, v), and to avoid this complication, one must compute test critical 
values using the value of p that maximizes p2 or that maximizes R. 


For a single endogenous regressor, both tests reject if the F statistic is 
less than a critical value tabulated by Stock and Yogo (2005). 


If there is more than one endogenous regressor, the relative bias B2 is 
defined with the quadratic forms in 3,, and R = Pr {Wass > x5(m)}, 
where Wyoorg = RRV ,)}71@,- The test statistic is then the Cragg- 


Donald minimum eigenvalue statistic. As noted in section 7.6.3, the 
minimum eigenvalue statistic was originally proposed by Cragg and 

Donald (1993) to test nonidentification. Stock and Yogo (2005) presume 
identification and interpret a low minimum eigenvalue to mean the 
instruments are weak. So the null hypothesis is that the instruments are weak 
against the alternative that they are strong. 


The F less than 10 rule of thumb 


In the case of a single endogenous regressor, the critical values for B = 0.10 
(a 10% maximum relative bias) range from 9.08 when Kə = 3 to 11.32 
when Kə = 30. And the critical values for a 5% Wald test with R = 0.10 (a 
10% maximum size distortion so true test size of at most 15%) range from 
8.96 when K» = 1 to 44.78 when K» = 30. 


These critical values for what may be a tolerable amount of bias or test 
size distortion are generally more than 10. For models with a single 
endogenous regressor, this has led to a rule of thumb, one suggested by 
Staiger and Stock (1997), that if F < 10, then instruments are very likely to 
be weak and standard asymptotics should not be applied. 


In practice, this rule is misinterpreted as meaning that if F > 10, then 
conventional Wald tests can be applied. There are many problems with this 
approach. In overidentified models, the Wald size-distortion test can have a 
critical value much larger than 10. The tests for weak instruments are 
restricted to models with 1.1.d. errors, yet in practice microeconometrics 
studies base inference on robust standard errors. And the size of a 
subsequent 5% Wald test of 3, is much higher than 5% because of pretesting 
for weak instruments; see section 7.7.1. 


Relative asymptotic bias of 2SLS with robust standard errors 


The preceding tests rely on the assumption of 1.1.d. errors. This is very 
limiting, especially with clustered data, where there can be great difference 
between robust and nonrobust first-stage F statistics. Montiel Olea and 
Pflueger (2013) provide a relative-bias test that is valid for first-stage tests 
based on homoskedastic, heteroskedastic—robust, cluster—robust, or 

HAC standard errors. The test is restricted to the case of only one endogenous 
regressor and to 2SLS and LML estimators, though this covers the majority of 
Iv applications. It can be applied to both just-identified and overidentified 
models, whereas the Stock—Yogo relative-bias test requires at least two 
overidentifying restrictions. 


Define Fug to be the “effective” F statistic that varies with the method 
used to obtain the vce of the estimator and whose formula is given in 
Montiel Olea and Pflueger (2013). In the just-identified case with robust 
standard errors, Fug equals the first-stage F statistic computed using robust 
standard errors, while with homoskedastic errors Fg reduces to the usual 
first-stage F statistic. More generally, Fig differs from both the default F 
and the robust F. 


Define « to be the approximate bias of 2SLS (or of LML), where the bias 
is calculated using a higher-order asymptotic approximation due to 
Nagar (1959), divided by a worst-case benchmark bias that occurs when 
instruments are completely uninformative and when the errors u and v are 
perfectly correlated. Then we obtain that value of Fig for which 


Pr(|k| > K*) =a 


for specified bias threshold «* and probability a. The critical values 
generally vary with the dataset, so they are not available in a table. The 
community-contributed weakivtest postestimation command, presented 
below, provides the critical values for «* = 0.05, 0.10, 0.20, and 0.30 and 
for user-provided a (with default œa = 0.05). 


In the special case of 1.1.d. errors and nonrobust F, the critical values of 
this test with «* = 0.10 and a = 0.05 are similar to the Stock—Yogo critical 
values for B = 0.10 as Kə ranges from 3 to 30. Andrews, Stock, and 
Sun (2019) advocate use of Fg for detecting weak instruments when errors 
are not 1.1.d. 


7.6.5 The estat firststage command 


Following ivregress, various diagnostics and tests for weak instruments are 
provided by estat firststage. The syntax for this postestimation command 
is 


estat firststage i all forcenonrobust | 


The ali option requests that all first-stage goodness-of-fit statistics be 
reported regardless of whether the model contains one or more endogenous 
regressors. 


The forcenonrobust option is used to allow use of estat firststage 
even when the preceding ivregress command computed a robust vce. This 
has the advantage of computing the correct first-stage F statistic. But the 
Stock—Yogo critical values are no longer valid because they require 
i.i.d. errors. A better procedure, possible when there is a single endogenous 
regressor, 1s to use the community-contributed weakivtest postestimation 
command, which implements the effective F statistic test of Montiel Olea 
and Pflueger (2013); see section 7.6.7. 


7.6.6 Just-identified model 


We consider a just-identified model with one endogenous regressor, with 

hi _empunion instrumented by one variable, ssiratio. For completeness, we 
add the a11 option to print Shea’s partial R2, which is unnecessary here 
because we have only one endogenous regressor. Because we use 

vce (robust) in ivregress, we need to add the forcenonrobust option. The 
output is in three parts. 


* Weak instrument tests - just-identified with heteroskedastic-robust errors 
. global x2list totchr age female blhisp linc 


. qui ivregress 2sls ldrugexp (hi_empunion = ssiratio) $x2list, vce(robust) 
. estat firststage, forcenonrobust all 


First-stage regression summary statistics 


Adjusted Partial Robust 
Variable R-sq. R-sq. R-sq. F(1,10061) Prob > F 
hi_empunion 0.0793 0.0787 0.0212 212.806 0.0000 
Shea’s partial R-squared 
Shea’s Shea’s 


Variable 


partial R-sq. adj. partial R-sq. 


hi_empunion 0.0212 0.0207 


Minimum eigenvalue statistic = 217.963 


Critical Values # of endogenous regressors: 1 
HO: Instruments are weak # of excluded instruments: 1 
5% 10% 20% 30% 
2SLS relative bias (not available) 
10% 15% 20% 25% 
2SLS size of nominal 5% Wald test 16.38 8.96 6.66 5.53 
LIML size of nominal 5% Wald test 16.38 8.96 6.66 5.53 


The first part is a summary table of key diagnostic statistics that are 
useful in suggesting weak instruments. The first two statistics are the R2 and 
adjusted- R2 from the first-stage regression. These are around ().08, so there 
will be considerable loss of precision because of Iv estimation. They are not 
low enough to flag a weak-instruments problem, although, as already noted, 
there may still be a problem because ssiratio may be contributing very 
little to this fit. To isolate the explanatory power of ssiratio in explaining 


hi_empunion, two Statistics are given. The partial R2 is that between 
hi_empunion and ssiratio after controlling for totchr, age, female, 
blhisp, and linc. This is quite low at 0.0212, suggesting some need for 
caution. The final statistic is an F statistic for the joint significance of the 
instruments excluded from the structural model. For this just-identified 
example, this is a test on just ssiratio, and F = 212.81, based on 
heteroskedastic—robust standard errors, is simply the square of the ¢ statistic 
from the first-stage regression (14.662 = 212.81). This F statistic of 212.81 
is considerably larger than the rule of thumb value of 10 that is sometimes 
suggested, so ssiratio does not seem to be a weak instrument. 


The second part gives Shea’s partial R2, which equals the previously 
discussed partial R2 because there is just one endogenous regressor. 


The third part implements the tests of Stock and Yogo (2005). As already 
noted, their use here is questionable because robust standard errors have 
been used. The relative bias test is not available, because the model is just 
identified, rather than overidentified by two or more restrictions. The output 
for the Wald test distortion test gives critical values for both the 2SLs 
estimator and the LIML estimator. We are using the 2SLS estimator, which is 
equivalent to the LIML estimator in the just-identified case. If we are willing 
to tolerate distortion for a 5% Wald test based on the 2SLS estimator, so that 
the true size can be at most 10%, then we reject the null hypothesis of weak 
instruments because F = 212.81 > 16.38. 


The Stock—Yogo critical values are not applicable here. In this case, 
however, default standard errors lead to a first-stage F = 217.96, which is 
similar to the heteroskedastic-robust F = 212.81, and both F statistics 
greatly exceed 16.38, so we feel comfortable in rejecting the null hypothesis 
of weak instruments. In other cases, use of estat firststage with robust 
standard errors will be much more questionable. 


7.6.7 The weakivtest command 


The community-contributed weakivtest command (Pflueger and 

Wang 2015) implements the relative asymptotic bias test of Montiel Olea 
and Pflueger (2013) following the ivregress command (or the ivreg2 
command, which is presented in section 7.6.9). 


As an example, we repeat the preceding just-identified example with 
heteroskedastic—robust standard errors. We obtain 


. * Weak instrument tests - just-identified using weakivtest 
. qui ivregress 2sls ldrugexp (hi_empunion = ssiratio) 

> $x2list, vce(robust) 

. weakivtest 

Cobs=10, 068) 


Montiel-Pflueger robust weak instrument test 


Effective F statistic: 212.806 
Confidence level alpha: 5% 


Critical Values TSLS LIML 


% of Worst Case Bias 


tau=5/, 37.418 37.418 
tau=107, 23.109 23.109 
tau=20% 15.062 15.062 
tau=30%, 12.039 12.039 


The test is being performed at the 5% confidence level; this is an option that 
can be changed. The effective F statistic in this just-identified example 
equals the nonrobust F statistic of 212.81. The critical value for 2SLs with 
10% worst case bias is 23.109, compared with, for example, 16.38 for Wald 
test size distortion of 5% using Stock—Yogo critical values that assume 

1.1.d. errors. Despite this higher threshold, we reject the null hypothesis of 
weak instruments. 


7.6.8 Overidentified model 


For a model with a single endogenous regressor that is overidentified, the 
output is of the same format as the previous example. The F statistic will 
now be a joint test for the several instruments. If there are three or more 
instruments, so that there are two or more overidentifying restrictions, then 
the Stock—Yogo 2sLs relative-bias criterion can be used. 


We consider an example with three overidentifying restrictions: 


. * Weak instrument tests - overidentified with heteroskedastic-robust errors 
. qui ivregress 2sls ldrugexp (hi_empunion = ssiratio lowincome multlc firmsz) 
> $x2list, vce(robust) 


. estat firststage, forcenonrobust 


First-stage regression summary statistics 


Adjusted Partial Robust 
Variable R-sq. R-sq. R-sq. F(4,10058) Prob > F 
hi_empunion 0.0846 0.0838 0.0269 69.8112 0.0000 


Minimum eigenvalue statistic = 69.4773 


Critical Values # of endogenous regressors: 1 
HO: Instruments are weak # of excluded instruments: 4 


5% 10% 20% 30% 


2SLS relative bias 16.85 10.27 6.71 5.34 


10% 15% 20% 25% 
2SLS size of nominal 5% Wald test 24.58 13.96 10.26 8.31 
LIML size of nominal 5% Wald test 5.44 3.87 3.30 2.98 


Using either F = 69.81 or the minimum eigenvalue of 69.48, we firmly 
reject the null hypothesis of weak instruments. The endogenous regressor 

hi _empunion, from structural-model estimates not given, has a coefficient of 
— 0.817 and a standard error of 0.170 compared with — 0.811 and 0.188 
when ssiratio is the only instrument. 


The Stock—Yogo critical values for both tests are not appropriate here. 
From output not given, the weakivtest command yields an effective F 
equal to 74.36 that exceeds 5% test-level critical values of 19.92 and 12.03 
for, respectively, Wald test bias of 5% or 10%. The instruments do not 
appear to be weak. 


7.6.9 The ivreg2 command 


The community-contributed ivreg2 command (see Baum, Schaffer, and 
Stillman [2007]) provides additional estimators and tests to those provided 
by the ivregress command and stores many results conveniently in e (). It 
includes 2SLS, LIML, GMM, k-class, and continuously updating estimators. It 
computes a range of robust standard errors, including two-way cluster— 
robust standard errors. It provides endogeneity tests. And it provides a wider 


range of weak-instrument tests and diagnostics. Their 2007 article provides 
a very good overview of Iv methods and of the ivreg2 command. 


The syntax for the ivreg2 command is similar to that for the ivregress 
command. The first, sfirst, and ffirst options provide varying amounts 
of detail from the first-stage regressions. 


We fit the model with four instruments for the single endogenous 
regressors, use the first option, and request heteroskedastic—robust 
standard errors. We obtain 


* ivreg2 for overidentified model with heteroskedastic-robust standard errors 
ivreg2 ldrugexp (hi_empunion = ssiratio lowincome multlc firmsz) $x2list, 
> robust first 


First-stage regressions 


First-stage regression of hi_empunion: 


Statistics robust to heteroskedasticity 


Number of obs = 10068 
Robust 
hi_empunion | Coefficient std. err. t P>|t| [95% conf. interval] 
ssiratio -.1962855 .0153985 -12.75 0.000 -.2264696 -.1661015 
lowincome -.0586565 .0118359 -4.96 0.000 -.0818573 -.0354556 
multic . 1104924 . 0208033 5.31 0.000 .0697139 .151271 
firmsz . 0035953 .0018985 1.89 0.058 -.0001261 .0073168 
totchr .0137979 . 0036355 3.80 0.000 .0066716 .0209243 
age -.0078626 .0007087 -11.09 0.000 -.0092518 -.0064734 
female -.07128 . 0096042 -7.42 0.000 -.0901062 -.0524538 
blhisp -.0658391 .0121769 -5.41 0.000 -.0897082 -.0419699 
linc .0387941 . 0062042 6.25 0.000 .0266326 .0509556 
-cons 1.000387 .0578014 17.31 0.000 . 8870848 1.113689 
F test of excluded instruments: 
F( 4, 10058) = 69.81 
Prob > F = 0.0000 
Sanderson-Windmeijer multivariate F test of excluded instruments: 
F( 4, 10058) = 69.81 
Prob > F = 0.0000 
Summary results for first-stage regressions 
(Underid) (Weak id) 
Variable | FC 4, 10058) P-val | SW Chi-sq( 4) P-val | SW FC 4, 10058) 
hi_empunion | 69.81 0.0000 | 279.52 0.0000 | 69.81 


NB: first-stage test statistics heteroskedasticity-robust 


Stock-Yogo weak ID F test critical values for single endogenous regressor: 


5% maximal IV relative bias 16.85 
10% maximal IV relative bias 10.27 
20% maximal IV relative bias 6.71 
30% maximal IV relative bias 5.34 
10% maximal IV size 24.58 
15% maximal IV size 13.96 
20% maximal IV size 10.26 
25% maximal IV size 8.31 


Source: Stock-Yogo (2005). Reproduced by permission. 
NB: Critical values are for i.i.d. errors only. 


Underidentification test 

Ho: matrix of reduced form coefficients has rank=Ki-1 (underidentified) 
Ha: matrix has rank=K1 (identified) 

Kleibergen-Paap rk LM statistic Chi-sq(4)=234.51 P-val=0.0000 


Weak identification test 

Ho: equation is weakly identified 
Cragg-Donald Wald F statistic 
Kleibergen-Paap Wald rk F statistic 


Stock-Yogo weak ID test critical values for Ki=1 and L1=4: 
5% maximal IV relative 
10% maximal IV relative 
20% maximal IV relative 
30% maximal IV relative 
10% maximal IV size 
15% maximal IV size 
20% maximal IV size 
25% maximal IV size 
Source: Stock-Yogo (2005). Reproduced by permission. 
NB: Critical values are for Cragg-Donald F statistic and i. 


Weak-instrument-robust inference 
Tests of joint significance of endogenous regressors B1 in 
Ho: Bi=0 and orthogonality conditions are valid 


69.48 
69.81 


bias 16.85 
bias 10.27 
bias 6.71 
bias 5.34 
24.58 
13.96 
10.26 
8.31 


i.d. errors. 


main equation 


P-val=0.0000 
P-val=0.0000 
P-val=0.0000 


Anderson-Rubin Wald test F(4,10058)= 9.74 

Anderson-Rubin Wald test Chi-sq(4)= 38.99 

Stock-Wright LM S statistic Chi-sq(4)= 35.94 

NB: Underidentification, weak identification and weak-identification-robust 

test statistics heteroskedasticity-robust 

Number of observations N = 10068 

Number of regressors K = 7 

Number of endogenous regressors Ki = 1 

Number of instruments L = 10 

Number of excluded instruments Li = 4 


IV (2SLS) estimation 


Estimates efficient for homoskedasticity only 
Statistics robust to heteroskedasticity 


Number of obs = 10068 

FC 6, 10061) = 339.04 

Prob > F = 0.0000 

Total (centered) SS = 18664.39028 Centered R2 = 0.0818 

Total (uncentered) SS = 441571.929 Uncentered R2 = 0.9612 

Residual SS = 17137.60036 Root MSE = 1.305 
Robust 

ldrugexp | Coefficient std. err. z P>|zl [95% conf. interval] 

hi_empunion -.8174208 .1705701 -4.79 0.000 -1.151732 - .4831095 

totchr . 4492737 .0100416 44.74 0.000 .4295926 . 4689548 

age -.0123008 .0026668 -4.61 0.000 -.0175276 -.0070741 

female -.0145257 .0302726 -0.48 0.631 -.073859 .0448075 

blhisp -.2093601 .0380957 -5.50 0.000 -.2840264 -.1346939 

linc .0820149 .0197741 4.15 0.000 . 0432584 . 1207714 

_cons 6.698209 . 2335101 28.68 0.000 6.240537 7.15588 

Underidentification test (Kleibergen-Paap rk LM statistic): 234.509 

Chi-sq(4) P-val = 0.0000 


Weak identification test (Cragg-Donald Wald F statistic): 69.477 


(Kleibergen-Paap rk Wald F statistic): 69.811 

Stock-Yogo weak ID test critical values: 5% maximal IV relative bias 16.85 
10% maximal IV relative bias 10.27 

20% maximal IV relative bias 6.71 

30% maximal IV relative bias 5.34 

10% maximal IV size 24.58 

15% maximal IV size 13.96 

20% maximal IV size 10.26 

25% maximal IV size 8.31 


Source: Stock-Yogo (2005). Reproduced by permission. 
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. 


Hansen J statistic (overidentification test of all instruments): 11.027 
Chi-sq(3) P-val = 0.0116 


Instrumented: hi_empunion 
Included instruments: totchr age female blhisp linc 
Excluded instruments: ssiratio lowincome multlc firmsz 


The first set of output presents OLS estimates of the first-stage model for 
the endogenous regressor hi_empunion, and the output is identical to that 
obtained using ivregress with option first. If we had used option sfirst 
rather than first, the output would additionally include OLs estimates of the 
similar first-stage model for 1drugexp. 


The next set of output gives various summary measures from the first- 
stage regression. The F = 69.81 is the same as that given in section 7.6.8 
using the estat firststage command, and the Cragg—Donald statistic of 
69.48 is the F statistic if default nonrobust standard errors were used. 
Critical values are available for both Stock—Yogo weak-instrument tests 
because there are at least two overidentifying restrictions, though these are 
for models with 1.1.d. errors. There is no indication of a weak-instrument 
problem. The output also gives the Anderson—Rubin Wald test for statistical 
significance of the endogenous regressor hi empunion that is robust to weak 
instruments as explained in section 7.7.2. 


The final set of output gives 2sLs estimates of the structural equation that 
equal those from the ivregress command given in section 7.4.4. It includes 
some repetition of the underidentification and weak identification tests 
because the final set of output is all that would be given if the first option 
was not used. The overidentification test statistic equals 11.03 with 
p = 0.012, so the three overidentifying restrictions are rejected at level 0.05. 


The endogenous regressor hi_empunion has z = —4.79, corresponding 
to F(1, N — K) = 4.79? = 22.94. By comparison, the Anderson—Rubin 
statistic, which is robust to weak instruments, has F'(4, N — K) = 9.74. In 
both cases, p = 0.000, so the endogenous variable nhi empunion is 
statistically significant at the 5% level. 


7.6.10 More than one endogenous regressor 


The preceding applications had just one endogenous regressor. With more 
than one endogenous regressor, estat firststage reports weak-instruments 
diagnostics that include Shea’s partial R? and the Stock—Yogo tests based on 
the Cragg—Donald minimum eigenvalue statistic, which generalizes the F 
statistic. The a11 option additionally leads to reporting for each endogenous 
regressor the first-stage regression and associated F statistic and partial R2. 


The Cragg—Donald minimum eigenvalue statistic provides an overall test 
of weak instruments. Sanderson and Windmeijer (2016) propose corrected 
conditional F statistics for the first-stage regressions of the endogenous 
regressors that, under the assumption of 1.1.d. errors, can be compared with 
Stock—Yogo critical values. This provides additional detail on the nature of 
any weak-instrument problem. The ivreg2 output includes the Sanderson— 
Windmeijer test. 


7.6.11 Sensitivity to choice of instruments 


In the main equation, hi_empunion has a strong negative impact on 
ldrugexp. This contrasts with a small positive effect observed in the OLS 
results when hi_empunion is treated as exogenous; see section 7.4.6. If our 
instrument ssiratio is valid, then this would suggest a substantial bias in 
the OLS result. But is this result sensitive to the choice of the instrument? 


To address this question, we compare results for four just-identified 
specifications, each estimated using just one of the four available 
instruments. We present a table with the structural-equation estimates for OLS 
and for the four Iv estimations, followed by the heteroskedastic—robust first- 
stage F statistic. We use the ivreg2 command because it actually saves the 
F Statistic in e (widstat). We have 


. * Compare four just-identified model estimates with different instruments 
. qui regress ldrugexp hi_empunion $x2list, vce(robust) 


. estimates store OLSO 

. qui ivreg2 ldrugexp (hi_empunion=ssiratio) $x2list, robust 
. estimates store IV_INST1 

. scalar f1 = e(widstat) 

. qui ivreg2 ldrugexp (hi_empunion=lowincome) $x2list, robust 
. estimates store IV_INST2 

. scalar f2 = e(widstat) 

. qui ivreg2 ldrugexp (hi_empunion=multlc) $x2list, robust 

. estimates store IV_INST3 

. scalar f3 = e(widstat) 

. qui ivreg2 ldrugexp (hi_empunion=firmsz) $x2list, robust 

. estimates store IV_INST4 

. scalar f4 = e(widstat) 


. estimates table OLSO IV_INST1 IV_INST2 IV_INST3 IV_INST4, b(%8.4f) se 


Variable OLSO IV_INST1 IV_INST2 IV_INST3 IV_INST4 
hi_empunion 0.0732 -0.8115 0.0910 -1.3491 -2.9340 
0.0260 0.1884 0.3607 0.4249 1.4038 

totchr 0.4402 0.4492 0.4400 0.4547 0.4708 
0.0094 0.0100 0.0101 0.0116 0.0204 

age -0.0034 -0.0122 -0.0032 -0.0176 -0.0335 
0.0019 0.0028 0.0042 0.0048 0.0144 

female 0.0570 -0.0141 0.0584 -0.0572 -0.1844 
0.0254 0.0310 0.0382 0.0449 0.1201 

blhisp -0.1487 -0.2090 -0.1475 -0.2455 -0.3534 
0.0342 0.0384 0.0417 0.0490 0.1099 

linc 0.0116 0.0815 0.0102 0.1240 0.2492 
0.0137 0.0208 0.0312 0.0373 0.1135 

_cons 5.8464 6.6925 5.8295 7.2067 8.7224 
0.1574 0.2449 0.3835 0.4443 1.3651 


Legend: b/se 


. display "Robust first-stage F: " f1 _s(2) f2 _s(2) f3 _s(2) f4 
Robust first-stage F: 212.80621 58.760623 52.412092 14.357187 


The different instruments produce very different Iv estimates for the 
coefficient of the endogenous regressor hi_empunion, though they are within 
two standard errors of each other (with the exception of that with lowincome 
as the instrument). All differ greatly from OLS estimates, aside from when 


lowincome is the instrument (I1v_rInst2). The coefficient of the most highly 
statistically significant regressor, totchr, changes little with the choice of 
instrument. Coefficients of some of the other exogenous regressors change 
considerably, though there is no sign reversal aside from female. 


The heteroskedastic—robust first-stage F statistic for the final model with 
firmsz is low enough to indicate a weak-instrument problem using this 
instrument. Because inference here is heteroskedastic—robust, we perform 
the relative asymptotic bias test of Montiel Olea and Pflueger (2013), with 
maximum relative 10% bias occurring with probability 0.05. We obtain 


. * Montiel-Olea critical value for 10% relative bias with alpha = 0.05 
. qui ivregress 2sls ldrugexp (hi_empunion=firmsz) $x2list, vce(robust) 


. qui weakivtest 


. di "Effective F = " r(F_eff) " with critical value = " r(c_TSLS_10) 
Effective F = 14.357187 with critical value = 23.10851 


. clear mata 


The final instrument, firmsz, is weak. 


When we perform a similar sensitivity analysis by progressively adding 
lowincome, multic, and firmsz as instruments to an originally just- 
identified model with ssiratio as the instrument, there is little change in the 
2SLS estimates; see the exercises at the end of this chapter. 


There are several possible explanations for the sensitivity in the just- 
identified case. The results could reflect the expected high variability of the 
Iv estimator in a just-identified model, variability that also reflects the 
differing strength of different instruments. Some of the instruments may not 
be valid instruments. If treatment effects are heterogeneous, then Iv can be 
given a LATE interpretation (see section 25.5), especially in the two cases 
(Lowincome and mult1c) with binary instruments. Then the instruments may 
correspond to different LATES. 


Perhaps it is equally the case that the results reflect model 
misspecification—relative to a model one would fit in a serious empirical 
analysis of 1drugexp, the model used in the example is rather simple. While 
here we concentrate on the statistical tools for exploring the issue, in practice 


a careful context-specific investigation based on relevant theory is required 
to satisfactorily resolve the issue. 


7.7 Inference with weak instruments 


The preceding section presented diagnostics and tests for weak instruments. 
Here we first discuss problems that arise from the common practice of 
performing hypothesis tests and computing confidence intervals following 
pretesting for weak instruments. 


We then present inference that is valid regardless of whether instruments 
are weak and does not entail pretesting for weak instruments. The leading 
example is the Anderson—Rubin Wald test, the preferred method in a just- 
identified model. 


7.7.1 Tests following pretesting for weak instruments 


A common procedure is to first screen for weak instruments and then 
perform regular asymptotic inference on the coefficient of endogenous 
regressors if the test is passed. Such pretesting is problematic because it 
distorts test size. 


For simplicity, consider the leading case of a just-identified equation 
with a single endogenous regressor. We decide that instruments are not weak 
if the first-stage F exceeds some critical value F*, such as F* — 10, and 
then proceed to perform a regular two-sided Wald test of Hy: 8; = 0 and 
determine statistical significance at 5% if |t| > 1.96. 


Andrews, Stock, and Sun (2019), for example, provide Monte Carlo 
evidence that such screening on the first-stage F' can lead to large size 
distortion of the subsequent Wald test. The problem is that the true test size 
is not simply Pr(|t| > 1.96) but is instead Pr{(F > F*) U (|t| > 1.96)}. 
For example, with i.i.d. errors, the Stock—Yogo critical value for a 5% Wald 
test with size distortion of at most 10% is 8.96. So if we pass a weak- 
instrument screening test of whether F > 8.96, then the true test size of a 
nominal 5% test of whether 3, = 0 could be as high as 15%! 


Lee et al. (2021) argue that weak instruments can be a problem in some 
applications that have F much, much larger than 10. They address this issue 


in detail for the model with single endogenous regressor and single 
instrument. They use weak-instrument asymptotics that permit non- 

1.1.d. errors, such as clustered errors, provided the F statistic and Wald test 
are based on the same robust standard error method. The authors find that a 
nominal 5% Wald test with critical value 1.96 that follows a pretest that 

F > 10 has true size that could be as high as 11.3%. Even more problematic 
is that it is very difficult to attain a test with true size of 5%. Without other 
restrictions, it is possible if F > 104.7 and |t| > 1.96, which requires a very 
strong first stage, and it is possible if F > 10 and |t| > 3.43, which requires 
a highly statistically significant structural parameter estimate. The authors 
provide tables of selected values of adjusted Wald test critical values, called 
tF critical values, that decrease as the first-stage F statistic increases. 


These limitations become even greater when effective sample sizes are 
not so large that asymptotic theory provides a good approximation; see 
section 7.8. 


If it is felt that instruments are likely to be weak, it is best to not first 
screen on the first-stage F statistic, though it remains informative to report 
F. Instead, one should use inference methods that are robust to weak 
instruments, methods that we now present. 


7.7.2 Anderson—Rubin Wald test 


Consider the model with structural equation (7.5) and first-stage equations 
(7.6). In matrix terms, yı = Y2G, + X18, + u and 

Yə = X Il; + X-I + V, where X, are the additional instruments for the 
endogenous regressors y2. 


The reduced-form equation for y1 is obtained by substituting out Y3. We 
have 


yı = Y28,+X%i18,+u 
= (X Il; + XI + V)B, + X18,+u 
= X (Ilı 6; + By) + X2(ThB,) + (VB, + u) 
= Xiy+ Xod +w 


where aœ = IIb x 684. For a just-identified single endogenous regressor 
model, a = 72/3}. 


It follows that G, = 0 implies that œ = 0. So we can test Ho: 8, = 0 
by performing a Wald test of whether the coefficients of the instruments x2; 
are 0 in the reduced-form regression of yı; on the exogenous regressors X1; 
and the instruments X2;. 


The attraction of this approach is that the right-hand side regressors do 
not involve the endogenous regressors y2:. This leads to a denominator in 
the formula for the estimator that does not have the randomness property that 
causes problems for the usual 2SLs estimator. So there is no problem of weak 
instruments, and one can use regular asymptotics. This test due to Anderson 
and Rubin (1949) is called the AR test. As pointed out by Chernozhukov and 
Hansen (2008), one can also base inference on heteroskedastic—robust, 
cluster—robust, or HAC standard errors. 


The limitation of this approach is that it loses power in overidentified 
models. One way to see this is to note that in the common case of a single 
endogenous regressor, we want to test only the scalar 81, but the test is a 
joint test of Kə parameters, where Kə is the number of instruments x2;. A 
second way to see this is to note that a also equals 0 if II, = O, so the test is 
also one of identification. 


Anderson—Rubin test example of test of statistical significance 


To illustrate the method, we consider an overidentified example, with 
instruments ssiratio and lowincome for the single endogenous regressor 
hi_empunion, so 3; is a scalar, in the structural equation with dependent 
variable 1drugexp. Heteroskedastic-robust inference is performed. We have 


. * Anderson-Rubin test for overidentified sample with robust standard errors 
. regress ldrugexp ssiratio firmsz $x2list, vce(robust) 


Linear regression Number of obs 10,068 
F(7, 10060) 326.22 
Prob > F 0.0000 
R-squared 0.1788 
Root MSE 1.2343 

Robust 
ldrugexp | Coefficient std. err. t P>|t| [95% conf. interval] 
ssiratio . 1740745 .0395217 4.40 0.000 . 096604 251545 
firmsz -.0191037 . 0059784 -3.20 0.001 -.0308226 -.0073847 
totchr . 4382028 . 0093954 46.64 0.000 .4197859 - 4566196 
age -.0055112 0019412 -2.84 0.005 -.0093163 -.0017061 
female .0451868 .0253462 1.78 0.075 - .0044968 .0948704 
blhisp -. 1572397 .0341391 -4.61 0.000 -.2241591 -.0903202 
linc . 0449053 .0148784 3.02 0.003 .0157407 .0740699 
_cons 5.865519 . 1557835 37.65 0.000 5.560152 6.170886 


. test ssiratio firmsz 
= 0 
(0) 


10060) = 
Prob > F 


( 1) ssiratio 
( 2) firmsz 


F( 2, 


The Anderson—Rubin Wald test statistic for statistical significance of 


14.98 
0.0000 


hi _empunion is F = 14,98 with p = 0.0000, so the endogenous regressor is 
highly statistically significant. This coincides with the value found using the 
ivreg2 command in section 7.6.9. 


Anderson—Rubin test for general test 


The AR procedure can be used to test values of 6} other than zero. 


Suppose we wish to test Hp: 8, = Bo. If we subtract Y2(,, from both 
sides of the original structural equation and obtain a reduced from as before, 


we have 


yı — Y26810 = Y2(B, — Bio) + X162 + U 


= (X Ilı + X2 + V)(61 — Bio) + X162 +U 
= X1 {Il (61 — Bio) + Bo} + X2{I12(61 — Bio)} + {V (61 — Bio) + U} 
=Xiy+ X2 +w 


where now a = IIs x (3, — Bio). Thus, we perform the same reduced- 
form regression as before, except the dependent variable is now 


Yli — ¥5;P10- 


The next example does so for a test of Ho: 81 = —0.6. 


. * Anderson-Rubin test for beta = -0.6 rather than beta 


= 0 


. qui generate yfordiffbeta = ldrugexp - (-0.6)*hi_empunion 


. regress yfordiffbeta ssiratio firmsz $x2list, vce(robust) 


Linear regression 


Robust 
yfordiffbeta | Coefficient std. err. 
ssiratio .0426252 . 0406361 
firmsz -.0157014 .0065527 
totchr -4462441 . 0097066 
age -.010578 .0020055 
female 0017949 . 0262003 
blhisp -. 1942316 .0352794 
linc .0721462 .0154753 
_cons 6.486113 . 1611032 


. 26 


Number of obs = 10,068 
F(7, 10060) = 312.39 
Prob > F = 0.0000 
R-squared = 0.1724 
Root MSE = 1.2754 
P>|t | [95% conf. interval] 
0.294 - .0370298 . 1222802 
0.017 -.0285459 -.0028569 
0.000 . 4272172 . 4652709 
0.000 -.0145091 -.0066469 
0.945 -.049563 .0531528 
0.000 - .2633862 -.1250769 
0.000 .0418116 . 1024809 
0.000 6.170319 6.801908 


. test ssiratio firmsz 


( 1) ssiratio = 0 
( 2) firmsz = 0 


F( 2, 10060) 3.44 
Prob > F = 0.0320 


The hypothesis that 6 = —0.6 is rejected at level 0.05 because 


p = 0.032 < 0.05. 


7.7.3 Anderson—Rubin Wald confidence regions 


From section 11.3.13, a confidence interval, or more generally a confidence 
region, can be obtained by inverting a test. Specifically, to obtain a 95% 
confidence set for a parameter 6, we perform a two-sided test of 6 = 9* fora 
range of values of 9*. The confidence set is then those values of 9* for which 


the test has p > 0.05 because the 95% confidence interval includes those 
values that we do not reject at level 0.05. 


In the single endogenous regressor case, we fit by OLS the model 
Yii — BY Yai = X,Y + X56; + w; for a range of values of B7. A 95% Wald 
AR confidence region for (1 is those values of G7 for which a Wald test at 
level 5% does not lead to rejection of the hypothesis that q = 0. 


For the current example, we know from the preceding AR 5% tests that a 
95% confidence interval does not include 6; = 0.0, because Ho: 6; = 0.0 
was rejected at level 0.05. Similarly, it does not include 6; = —0.6, because 
in section 7.7.2 Ho: 6, = —0.06 was not rejected at level 0.05. From output 
given below, the community-contributed condivreg command yields a 
heteroskedastic—robust 95% confidence interval [—1.045, —0.791] for the 
current overidentified example. 


In general, inversion of tests can yield confidence intervals that are 
disjoint, such as { (a, b) U (c, d)}, or even empty. Thus, the confidence 
intervals are more correctly referred to as confidence regions or confidence 
sets. 


The AR confidence regions could be any of the forms 1) (b1, b2); 2) 
(—o0, b1) U (b2, 00); 3) (—co, co); or 4) @, where Ø denotes the empty set. 
The third case can arise when the instruments are so weak that the data 
provide very little information on 61. The fourth case can arise because the 
AR test is also one of identification. 


Practitioners can find this unsettling because the standard Wald test 
based on By 2s always yields confidence regions of the first form. The AR 
Wald test based on & has the added complication that the estimate of a 
varies with 3}. In the just-identified case, we denote the estimate a( (7). 
Then the nonrejection region is [@(67)/se{a(8*)}]? < 1.962. 


7.7.4 Tests under weak-instrument asymptotics 


The Anderson—Rubin approach does not require the use of weak-instrument 
asymptotics. It has low power in overidentified models, however, leading to 


alternative procedures that do use weak asymptotics. 


We focus on inference on the coefficients 3, of the endogenous 
regressors in the structural model. Then weak-instrument asymptotics yield 
tests with asymptotically correct size and confidence intervals with 
asymptotically correct coverage even if instruments are weak. 


Specifically, the asymptotics assume that II, = C/V N, where IT, are 
the coefficients of the instruments in the first-stage model for the 
endogenous regressors: Yo = X M + XIM + V. Specifying IT, to 
decline at rate ,/V has the consequence that the concentration parameter in 
(7.15) is constant as the sample size grows, rather than declining at rate 1/N 


The analysis is complicated by the fact that, unlike conventional 
asymptotics, the Wald, LR, and LM weak-instrument tests are not 
asymptotically equivalent under local alternatives if the model is 
overidentified. 


Conditional likelihood-ratio test 


For models with 1.1.d. errors, Moreira (2003) proposed the conditional 
likelihood-ratio (CLR) test. The test is the LR test for Ho: 8, = Bo based on 
the assumptions that errors in the reduced form for (y1, y2) are 1.1.d. 

N (0, Q) and that Q is known; the term “conditional” arises because it is the 
LR test conditional on Q being known. The resulting properties of the test 
require only the 1.i.d. assumption; they do not require normality or that Q be 
known. 


With a single endogenous regressor and a just-identified model, the test 
reduces to the AR Wald test. In overidentified models, the CLR test has 
optimal power and has better power than an LM test, while the AR test can 
have very poor power. 


The condivreg command 


The community-contributed condivreg command, developed by Mikusheva 
and Poi (2006), implements these tests that are surveyed and further 


developed in Andrews, Moreira, and Stock (2007) and Mikusheva (2010). It 
also reports confidence intervals that are obtained by inverting the test 
statistics. 


The condivreg command has the same basic syntax as ivregress, 
except that the specific estimator used (2s1s or 1im1) is passed as an option. 
The default is to report a corrected p-value for the test of statistical 
significance and a corrected 95% confidence region based on the CLR test 
statistic (Moreira 2003). The 1m option computes the LM test and associated 
confidence interval, and the ar option computes the Anderson—Rubin test 
statistic. The test (#) option is used to get P-values for tests of values other 
than zero for the coefficient of the endogenous regressor. 


For the current overidentified example with two instruments, we have 


* Condivreg: Weak instrument robust inference for i.i.d. errors 
condivreg ldrugexp (hi_empunion=ssiratio firmsz) $x2list, 2sls lm ar test(0) 


Instrumental variables (2SLS) regression 


First-stage results 


F( 2, 10060) 
Prob > F 
R-squared 
Adj R-squared 


112.55 
0.0000 
0.0799 
0.0793 


Number of obs = 10068 
FC 6, 10061) = 319.93 
Prob > F = 0.0000 
R-squared = 0.0654 
Adj R-squared = 0.0649 
Root MSE = 1.317 


ldrugexp | Coefficient Std. err. t P>|t | [95% conf. interval] 
hi_empunion -.8910603 . 1881943 -4.73 0.000 -1.259959 -.5221619 
totchr .4500233 .0103835 43.34 0.000 . 4296695 .4703771 

age -.0130392 . 0027449 -4.75 0.000 -.0184197 -.0076586 

female - .0204353 .030708 -0.67 0.506 -.0806291 .0397584 
blhisp -.214372 . 0382284 -5.61 0.000 -. 2893072 -. 1394368 

linc . 0878329 . 0209322 4.20 0.000 . 0468016 . 1288642 

_cons 6.768635 . 2416588 28.01 0.000 6.294936 7.242335 


Instrumented: 
Instruments: 
Confidence set and p-value for hi_empunion are based on normal approximation 


hi_empunion 


totchr age female blhisp linc ssiratio firmsz 


Test 


Conditional LR 
Anderson-Rubin 


Score (LM) 


Coverage-corrected confidence sets and p-values 
for Ho: _b[hi_empunion] 
LIML estimate of _b[hi_empunion] 


[-1.317044, -.5517649] U [ 6.650289, 


Confidence Set 


= -.91599 


[-1.313162, -.5549692] 
[-1.045006, -.7910382] 


p-value 
0.0000 
0.0000 
7.420294] 0.0000 


All tests have p = 0.0000 and strongly reject the null hypothesis that 
hi_empunion Is Statistically insignificant. 


Mikusheva (2010) develops fast algorithms for computing the tests and 
summarizes the possible shapes of the AR, CLR, and LM confidence regions 
when errors are 1.1.d. 


The AR confidence regions could be any of the forms 1) (b4, b2); 2) 
(—oo, b1) U (b2, 00); 3) (—o0, 00); or 4) Ø, where @ denotes the empty set. 
The CLR confidence regions could be any of the first three forms just given. 


The LM confidence region could be any of 1) (b1, b2) U (b3, b4); 2) 
(—oo, bı) U (b2, b3) U (b4, oo); or 3) (09; 00). 


Mikusheva (2010) provides strong reasons for not using the LM test. 
While the CLR confidence regions are theoretically better than those for AR in 
overidentified models, they are not guaranteed to be shorter in a given 
application. 


7.7.5 Minimum distance-based tests and confidence regions 


Stock and Wright (2000) considered weak-instrument inference for optimal 
GMM estimators, including nonlinear GMM. They proposed so-called s tests 
and confidence intervals, where s denotes a suitable objective function, and 
considered homoskedastic and heteroskedastic errors. In the simplest case of 
linear simultaneous equations the s confidence intervals reduce to AR 
confidence intervals. Kleibergen (2005) proposed extensions of the CLR and 
LM tests to the continuous updating GMM estimator. 


Magnusson (2010) proposed tests based on minimum distance (MD) 
estimation rather than GMM estimation. This bases estimation of the key 
structural parameter 3, on only the reduced-form parameter estimates. The 
advantage is that weak instruments pose no problem for estimating the first- 
stage model parameters. And various estimators can be used for the variance 
of reduced-form estimators, allowing for heteroskedastic—robust and cluster- 
robust standard errors; the theory is not limited to 1.1.d. errors. 


As an example, consider the model of section 7.7.2. Then the reduced 
form for yı had the restriction œ = Ia x 684, which implies 
a — II, x GB, = 0. This suggests estimating the structural parameters 3, as 
the solution to & — ine x 3, =0, where q and f II, are obtained from 


estimation of the reduced forms for, respectively, yi and y2. In the just- 
identified case, there is a solution for &. In the overidentified case, we use 
the MD estimator, which minimizes a quadratic form in (@ — II, x By) 


More generally, let 3 denote a vector of structural form estimators we 
wish to estimate, let m denote a vector of reduced-form parameters, and 


suppose that the model implies h((, m) = 0. Then the MD estimator 
minimizes the quadratic form 


Q(B) =h(G,7)' Ah (B, 3) (7.17) 


where { is a consistent estimate of 
{Oh(, 77) /O7'} Var(7){Oh(G, m) /O7}. See, for example, Cameron and 
Trivedi (2005, 202) or Wooldridge (2010, 545). 


Magnusson (2010) proposed MD analogs of the AR, CLR, and LM tests and 
confidence intervals. 


The rivtest command 


The MD method can be applied to both linear and nonlinear models. The 
community-contributed rivtest command (Finlay and Magnusson 2009) 
enables weak-instrument inference on a single endogenous regressor 
following regression using the ivregress, ivreg2, ivprobit, and ivtobit 
commands. The option nui1() allows specification of the null hypothesis; 
the default is 8 = 0. The option ci yields confidence intervals. 


Continuing the current example, we obtain 


. * rivtest: Weak instrument robust inference for non-i.i.d. errors 
. qui ivregress 2sls ldrugexp (hi_empunion=ssiratio firmsz) $x2list, vce(robust) 


. rivtest, ci null(0) 
Estimating confidence sets over grid points 


Beth cea atte rece EE Sar asa E E A EE Sid st weak chau, E cena 100 


Weak instrument robust tests and confidence sets for linear IV with robust VCE 
HO: beta[ldrugexp:hi_empunion] = 0 


Test Statistic p-value 95% Confidence Set 
CLR | stat(.) = 26.36 Prob > stat = 0.0000 [-1.25987 ,-.537304] 
AR | chi2(2) = 29.99 Prob > chi2 = 0.0000 [-1.12439,-.657732] 
LM | chi2(1) = 25.97 Prob > chi2 = 0.0000 [-1.27492,-.537304] 

J | chi2(1) = 4.02 Prob > chi2 = 0.0450 
LM-J HO rejected at 5% level [-1.28998 ,-.522251] 


Wald chi2(1) = 21.97 Prob > chi2 = 0.0000 [-1.26363,-.518488] 


Note: Wald test not robust to weak instruments. Confidence sets estimated for 100 
points in [-1.63621,-.145915]. 


The first three tests are heteroskedastic—robust versions of the weak- 
instrument CLR, AR, and LM tests. In overidentified models, the AR test that 
a = II,(G — Bo) = 0 decomposes into the sum of the LM test and the 
overidentification J test—here 29.99 = 25.97 + 4.02. The LM-J test is a 
weighted combination of the two tests. The default is to put 80% on the LM 
test, so the LM test is performed at level 0.04, and the J test at level 0.01. 


Unlike for the condivreg command, there are no disjoint confidence 
regions, though the grid search using program defaults is only over the range 
(—1.636, —0.146). 


The weakiv command 


The community-contributed command weakiv (Finlay, Magnusson, and 
Schaffer 2014) is an updated and expanded version of the rivtest 
command. It allows for multiple endogenous regressors, covers a wide range 
of robust variance—covariance estimators, including two-way clustering and 
HAC, and covers some panel commands. 


The weakiv command can be implemented as a postestimation command 
for linear Iv estimation after ivregress, xtivreg, xtabond, ivtobit, and 


ivprobit, as well as after the community-contributed commands ivreg2 and 
xtivreg2. Alternatively, it can be used as a stand-alone command, with the 
user providing the specification of the model. 


We apply the weakiv command to the current example following 
ivregress with heteroskedastic—robust standard errors. 


. * weakiv: Weak instrument robust inference for non-i.i.d. errors 
. qui ivregress 2sls ldrugexp (hi_empunion=ssiratio firmsz) $x2list, 
> vce (robust) 


. weakiv, null(0) 
Estimating confidence sets over 100 grid points 


E as ee A 
ee ee ee ee ee a ee ee eee 100 


Weak instrument robust tests and confidence sets for linear IV 
HO: beta[ldrugexp:hi_empunion] = 0 


Test Statistic p-value Conf. level Conf. Set 
CLR | stat(.) = 23.26 0.0000 95% [-1.25987 ,-.507198] 
K | chi2(1) = 22.77 0.0000 95% [-1.25987 ,-.507198] 

J | chi2(1) = 4.50 0.0338 95% null set 
K-J <n.a.> 0.0000 | 95% (96%,99%)  [-1.28998,-.492144] 
AR | chi2(2) = 27.28 0.0000 95% [-1.12439,-.627625] 
Wald | chi2(1) = 21.97 0.0000 95% [-1.26363,-.518488] 


Confidence sets estimated for 100 points in [-1.63621,-.145915]. 

Number of obs N = 10068. 

Method = lagrange multiplier (LM). Weight on K in K-J test = 0.800. 

Tests robust to heteroskedasticity. 

Wald statistic in last row is based on ivregress estimation and is not robust to 
weak instruments. 


The form of the output is similar to that for rivtest. The LM test is labeled 
the Ķ test after Kleibergen (2005), who proposed the test. Surprisingly, there 
is some numerical difference in the test statistic values (aside from Wald) 
compared with rivtest. 


The rivtest and weakiv commands implement versions of weak- 
instrument tests that can be robustified to non-1.i.d. errors. But they are not 
optimal when errors are not 1.1.d., and there may be appreciable loss of test 
power, especially in models that have many excess instruments. Developing 


more powerful tests when model errors are not i.i.d. is a current area of 
research. 


7.8 Finite sample inference with weak instruments 


The preceding weak-instrument inference relies on an alternative local-to- 
zero asymptotic theory. Nonetheless, this is still an asymptotic theory, and 
much of the theory is restricted to 1.1.d. errors. 


Young (Forthcoming) performs a Monte Carlo analysis of a sample of 
30 leading applications of 2SLS estimators with a single endogenous 
regressor to evaluate the effect of non-1.1.d. errors and highly leveraged 
observations. Here the leverage (see section 3.6.3) of the instruments for the 
ith independent observation is defined to equal the ¿th diagonal entry in 
X2(X5X2)~!X, where Xz; are the residuals from regressing the 
instruments X2; on the exogenous regressors X1:; for clustered errors, a 
corresponding quantity is computed for the gth cluster. In the applications 
studied by Young (Forthcoming), some observations have very high 
leverage, not surprising because almost a half had less than 80 clusters or 
less than 80 observations if not clustered. 


Young (Forthcoming) finds that using standard robust inference 
methods, the combination of non-i.i.d. errors and high leverage leads to 
first-stage F tests that greatly overstate the statistical significance of 
instruments (less so if the effective F statistic of Montiel Olea and Pflueger 
[2013] is used), tests on the coefficient of the endogenous regressor that 
greatly overstate statistical significance, and increase in the finite sample 
bias of 2SLS. 


Simulations find that inference can be improved by using the jackknife 
or by performing either a pairs bootstrap or a wild bootstrap (see 
section 12.6). Under strong instrument asymptotics, percentile-t bootstraps 
provide an asymptotic refinement, while percentile bootstraps do not. Under 
weak-instrument asymptotics, however, none of these bootstraps provide an 
asymptotic refinement and in theory break down (as does the standard 
nonbootstrapped test) if in fact the model is unidentified. Here percentile 
bootstraps perform as well as percentile-t bootstraps. 


Section 12.6.5 implements a wild bootstrap of the Wald test for the 2sLs 
estimator using the community-contributed boottest command. 
Additionally, it provides a percentile-t wild bootstrap of the AR test, a 
bootstrap that does provide an asymptotic refinement because it relies only 
on strong-instrument asymptotic theory. Young (Forthcoming) did not 
consider the AR test, because the articles he studied did not use the AR test. 


7.9 Other estimators 


The literature suggests several alternative estimators that are asymptotically 
equivalent to 2SLS under strong instrument asymptotics, asymptotically 
different under weak-instrument asymptotics, and may have better finite- 
sample properties than 2sLs when instruments are weak. 


We present some of these estimators, as well as the two-sample 2SLS 
estimator that enables estimation given data on the dependent variable from 
one sample and data on the endogenous regressor from a second sample, 
provided both samples have data on common instruments and exogenous 
regressors. Some additional estimators are available in the community- 
contributed ivreg2 command (Baum, Schaffer, and Stillman 2007). 


7.9.1 LIML estimator 


The leading alternative to 2SLs is the LIML estimator. This estimator is based 
on the assumption of joint normality of errors in the structural and first-stage 
equations. It is an ML estimator for obvious reasons and is a limited- 
information estimator when compared with a full-information approach that 
specifies structural equations (rather than first-stage equations) for all 
endogenous variables in the model. 


The LIML estimator preceded 2sLs but has been less widely used because 
it is known to be asymptotically equivalent to 2sLs. Both are special cases of 
the -class estimators. The two estimators differ in finite samples, however, 
because of differences in the weights placed on instruments. Research has 
found that LIML has some desirable finite-sample properties, especially if the 
instruments are not strong. For example, several studies have shown that 
LIML has a smaller bias than either 2SLS or GMM. 


The LIML estimator is a special case of the so-called -class estimator, 
defined as 


E = {X'(I > kM.) X} > Xx'(I — kM.)~"y 


where the structural equation is denoted here as y = XG + u. The LIML 
estimator sets & equal to the minimum eigenvalue of 
(Y’MzY)~!/7Y'’Mx, Y(Y’MzY)~!/2, where 

Mx, =I- X: (XX1) tX, Mz = I — Z(Z'Z)~'Z, and the first-stage 
equations are Y = ZII + V. 


The estimator has a default VCE of 
V (Br-ciass) = s? {X'(I = kMz) X} 


where s? = /t/N under the assumption that the errors u and V are 
homoskedastic. A leading ķ-class estimator is the 2SLS estimator, when k = 1 


The LIML estimator is obtained by using the ivregress liml command 
rather than ivregress 2sls. The vce (robust) option provides a robust 
estimate of the VCE for LIML when errors are heteroskedastic. In that case, the 
LIML estimator remains asymptotically equivalent to 2SLs. But in finite 
samples, studies suggest LIML may be better. 


An alternative command that is equivalent to the ivregress lim] 
command is eregress, introduced in section 7.4.9, which can be used if we 
assume that the model is linear and recursive (see section 7.2.3 for a 
definition of this concept). For example, in a two-equation model in which 
yı depends upon a second endogenous variable y2 and an exogenous 
variable x, and an exogenous instrumental variable z is also available, the 
command eregress yl x, endogenous(y2 = x z) produces output identical 
to that from executing the ivregress 1iml command. 


The community-contributed mivreg command (Anatolyev and 
Skolkova 2019) considers a number of variations to LIML, including an 
adjustment due to Fuller (1977), that are more robust when there are many 
(possibly weak) instruments. 


7.9.2 Jackknife IV estimator 


The jackknife Iv estimator (JIVE) eliminates the correlation between the first- 
stage fitted values and the structural-equation error term that is one source of 
bias of the traditional 2SLS estimator. The hope is that this may lead to 
smaller bias in the estimator. 


Let the subscript (—7) denote the leave-one-out operation that drops the ; 
th observation. Denote the structural equation by y; = x; + u;, and 
consider first-stage equations for both endogenous and exogenous 
regressors, so x; = z II + v;. Then, for each i = 1,..., N, we estimate the 
parameters of the first-stage model with the jth observation deleted, 
regressing X(_;) on Z_;), and, given estimate [],, construct the instrument 
for observation ; as x’. = 2/11;- Combining for i = 1,..., N yields an 
instrument matrix denoted by X (—i) With the ith row x;, leading to the JIVE 


ae = = ee 
Bjve = (XiX) X(_iyy 


The community-contributed jive command (Poi 2006) has syntax that is 
similar to ivregress. The variants, specified as a command option, are 
ujivel and ujive2 (Angrist, Imbens, and Krueger 1999) and jive1 and 
jive2 (Blomquist and Dahlberg 1999). The default is ujive1. The robust 
option gives heteroskedasticity-robust standard errors. 


There is mixed evidence to date on the benefits of using JIVE; see the 
articles cited above and Davidson and MacKinnon (2006). Caution should 
be exercised in its use. 


7.9.3 Comparison of 2SLS, LIML, JIVE, and GMM 


We compare several estimators for an overidentified model with four 
instruments for hi_empunion. For the JIVE estimator, we use the community- 
contributed jive command, and for optimal GMM with heteroskedastic 


errors, we use both ivregress gmm and the community-contributed ivreg2 
command (Baum, Schaffer, and Stillman 2007). We have 


. * Variants of IV estimators: 25L5, LIML, JIVE, GMM_het, GMM-het using IVREG2 
. global ivmodel "ldrugexp $x2list 
> (hi_empunion = ssiratio lowincome multlc firmsz)" 


. qui ivregress 2sls $ivmodel, vce(robust) 

. estimates store TWOSLS 

. qui ivregress liml $ivmodel, vce(robust) 

. estimates store LIML 

. qui jive $ivmodel, robust 

. estimates store JIVE 

. qui ivregress gmm $ivmodel, wmatrix(robust) 
. estimates store GMM_het 

. qui ivreg2 $ivmodel, gmm2s robust 

. estimates store IVREG2 

. estimates table TWOSLS LIML JIVE GMM_het IVREG2, b(%7.4f) se 


Variable | TWOSLS LIML JIVE GMM_het IVREG2 
hi_empunion | -0.8174 -0.8612 -0.8550 -0.7809 -0.7809 
0.1706 0.1797 0.1783 0.1691 0.1699 

totchr 0.4493 0.4497 0.4497 0.4489 0.4489 
0.0100 0.0101 0.0101 0.0100 0.0100 

age -0.0123 -0.0127 -0.0127 -0.0120 -0.0120 
0.0027 0.0027 0.0027 0.0026 0.0027 

female -0.0145 -0.0180 -0.0175 -0.0087 -0.0087 
0.0303 0.0308 0.0307 0.0301 0.0302 

blhisp -0.2094 -0.2123 -0.2119 -0.2015 -0.2015 
0.0381 0.0385 0.0384 0.0378 0.0380 

linc 0.0820 0.0855 0.0850 0.0783 0.0783 
0.0198 0.0204 0.0203 0.0196 0.0197 

_cons 6.6982 6.7401 6.7341 6.6695 6.6695 
0.2335 0.2403 0.2392 0.2321 0.2330 


Legend: b/se 


Here there is little variation across estimators in estimated coefficients and 
standard errors. As expected, the last two columns give exactly the same 
coefficient estimates, though the standard errors differ slightly. 


7.9.4 Two-sample 2SLS 


Two-sample 2SLS is a variation of 2SLs that can be used when the complete 
data necessary for regular 2SLS are unavailable. As usual, we wish to fit the 
structural model y1; = 31 y2; + x};,G2 + ui, where the single endogenous 
regressor Y2: has first-stage equation y2; = x1;71 + X9;72 + Vi. 


The two-sample Iv method enables consistent estimation when one 
dataset has data on y1:, X1;, and X2;, but not on the endogenous regressor Ya:, 
and a second dataset has data on y2:, X1;, and X2;, but not on y1:, the 
dependent variable in the structural equation. 


The two-sample 2SLs estimator is obtained as follows. First, OLS 
regression in the second sample is used to obtain estimates 7, and 75. 
Second, these estimates are used to obtain prediction 7; using the first 
sample data. Third, using the first sample, we perform OLS regression of yı: 
on Jz; and X1;. This method uses the two-stage interpretation of the 2SLs 
estimator. 


Inference needs to control for the final regression involving a generated 
regressor. Initial research did so for homoskedastic errors. Pacini and 
Windmeijer (2016) extend this to heteroskedastic errors, and the appendix to 
their article provides Stata code for implementation. 


In general, a generated regressor such as 7; leads to attenuation bias. 
This becomes substantial for the two-sample 2SLs estimator when the 
instruments are weak and there are many instruments. Choi, Gu, and 
Shen (2018) provide weak-instrument inference in this case when errors are 
homoskedastic or heteroskedastic. 


7.10 Three-stage least-squares systems estimation 


The preceding estimators are asymmetric in that they specify a structural 
equation for only one variable, rather than for all endogenous variables. For 
example, we specified a structural model for 1drugexp but not one for 

hi empunion. A more complete model specifies structural equations for all 
endogenous variables. 


Consider a multiequation model with m (> 2) linear structural equations, 
each of the form 


Yji = Yjibj + Xjibj2 + Uji, a O T= leeg 


For each of the m endogenous regressors Yj, we specify a structural equation 
with the endogenous regressors Yj, the subset of endogenous variables that 
determine Yj, and the exogenous regressors Xj, the subset of exogenous 
variables that determine Yj. Model identification is secured by rank and 
order conditions, given in standard graduate texts, requiring that some of the 
endogenous or exogenous regressors be excluded from each Y; equation. 


The preceding Iv estimators remain valid in this system. And 
specification of the full system can aid in providing instruments because any 
exogenous regressors in the system that do not appear in X; can be used as 
instruments for Yj. 


Under the strong assumptions of error independence across ¿ and that for 
the jth observation Cov (uij, uik) = cij, more efficient estimation is possible 
by exploiting cross-equation correlation of errors, just as for the seemingly 
unrelated regressions model discussed in section 6.8. This estimator is called 
the three-stage least-squares (3SLS) estimator. Note that the assumptions 
include within-equation homoskedasticity and that the reg3 command does 
not provide robust standard errors as an option. 


For the example below, we need to provide a structural model for 
hi_empunion in addition to the structural model already specified for 


ldrugexp. We suppose that hi_empunion depends on the single instrument 
ssiratio, ON ldrugexp, and on female and blhisp. This means that we are 
(arbitrarily) excluding two regressors, age and linc. This ensures that the 
hi _empunion equation is overidentified. If instead it was just identified, then 
the system would be just identified because the 1drugexp is just identified, 
and 3SLS would reduce to equation-by-equation 2SLS. 


The syntax for the reg3 command is similar to that for sureg, with each 
equation specified in a separate set of parentheses. The endogenous variables 
in the system are simply determined because they are given as the first 
variable in each set of parentheses. We have 


. * 35LS estimation requires errors to be homoskedastic 
. reg3 (ldrugexp hi_empunion totchr age female blhisp linc) 
> (hi_empunion ldrugexp totchr female blhisp ssiratio) 


Three-stage least-squares regression 


Equation Obs Params RMSE "R-squared" chi2 P>chi2 
ldrugexp 10,068 6 . 300469 0.0877 1955.02 0.0000 
hi_empunion 10,068 5 1.6448 -10.4552 68.94 0.0000 
Coefficient Std. err. z P>|z| [95% conf. interval] 
ldrugexp 

hi_empunion -.7890506 . 1874758 -4.21 0.000 -1.156496 -.4216048 
totchr . 4491568 .010287 43.66 0.000 . 4289948 . 4693189 
age -.0131018 .002546 -5.15 0.000 -.0180919 -.0081118 
female -.0126266 .0304786 -0.41 0.679 -.0723634 .0471103 
blhisp -.2114008 .0378127 -5.59 0.000 -.2855124 -. 1372892 
linc .0724828 .0179903 4.03 0.000 .0372225 . 1077431 
_cons 6.7731 2221739 30.49 0.000 6.337647 7.208553 

hi_empunion 
ldrugexp 1.290877 . 3169658 4.07 0.000 .6696355 1.912119 
totchr -.5529128 . 1388413 -3.98 0.000 -.8250366 - . 2807889 
female -.1292064 .0354309 -3.65 0.000 -.1986498 -.059763 
blhisp . 150607 . 0683069 2.20 0.027 .0167279 . 2844861 
ssiratio - .4440772 .0594459 -7.47 0.000 -.560589 -.3275654 
_cons -6 . 668402 1.78005 -3.75 0.000 -10.15724 -3.179569 


Endogenous variables: ldrugexp hi_empunion 
Exogenous variables: totchr age female blhisp linc ssiratio 


7.11 Additional resources 


The ivregress command is the key command for linear tv models. The 
community-contributed ivreg2 command (Baum, Schaffer, and 
Stillman 2007) has many additional features, including two-way cluster— 
robust standard errors and additional tests of endogeneity. The approach 
generalizes to nonlinear 2sLs and GMM; an example is the ivpoisson 

gmm command. 


A major concern is that asymptotic theory performs poorly when 
instruments are weak or there are many instruments, an area that is still one 
of active research. One approach uses alternative inferential methods, the 
simplest of which is based on the Anderson—Rubin test. For a just-identified 
single endogenous regressor, one need only use the AR test. Otherwise, a 
range of methods, including the AR test, have been proposed. The 
community-contributed condivreg command (Mikusheva and Poi 2006) 
enables inference with weak instruments assuming 1.1.d. errors. The 
community-contributed command weakiv (Finlay and Magnusson 2009) 
extends the scope of the condivreg command by allowing non-i.1.d. errors. 
An alternative approach uses alternative estimators to 2SLS or LIML. The 
community-contributed jive command (Poi 2006) performs JIVE 
estimation. The community-contributed mivreg command (Anatolyev and 
Skolkova 2019) considers a number of variations to LIML. For many 
instruments, machine-learning regularization methods such as the lasso can 
be used; see section 28.8. 


For many nonlinear models with endogenous regressors, attention is 
restricted to ML estimation of models with a recursive structure and normal 
errors. The extended regression model commands presented in sections 23.7 
and 25.3 fit these models, and special cases of these commands overlap 
with, for example, the ivprobit and ivtobit commands. The related 
endogenous treatment commands et regress, eteffects, and etpoisson 
presented in section 25.4 explicitly control for a binary endogenous 
regressor. 


7.12 Exercises 


1. Estimate by 2sLs the same regression model as in section 7.4.4, with the 
instruments multlc and firmsz. Compare the 2SLs estimates with OLS 
estimates. Perform a test of endogeneity of hi_empunion. Perform a test of 
overidentification. State what you conclude. Throughout this exercise, 
perform inference that is robust to heteroskedasticity. 

2. Repeat exercise | using optimal GMM. 

3. Use the model and instruments of exercise 1. Compare the following 
estimators: 2SLS, LIML, and optimal GMM given heteroskedastic errors. For the 
last model, estimate the parameters by using the community-contributed 
ivreg2 command in addition to ivregress. 

4. Use the model of exercise 1. Compare 2SLS estimates as the instruments 
ssiratio, lowincome, multlc, and firmsz are progressively added. 

5. Use the model and instruments of exercise 1. Perform appropriate 
diagnostics and tests for weak instruments using the 2SLs estimator. State 
what you conclude. Throughout this exercise, perform inference assuming 
errors are 1.1.d. 

6. Use the model and instruments of exercise 1. Use the community- 
contributed condivreg command to perform inference for the 2sLs estimator. 
Compare the results with those using conventional asymptotics. 

7. Use the model and instruments of exercise 1. Use the community- 
contributed jive command, and compare estimates and standard errors from 
the four different variants of JIVE and from optimal GMM. Throughout this 
exercise, perform inference that is robust to heteroskedasticity. 

8. Fit the 3sLs model of section 7.10, and compare the 3sLs coefficient estimates 
and standard errors in the 1drugexp equation with those from 2SLs estimation 
(with default standard errors). 

9. This question considers the same earnings—schooling dataset as that 
analyzed in Cameron and Trivedi (2005, 111). The data are in 
mus207klingdata.dta. The describe command provides descriptions of the 
regressors. There are three endogenous regressors (years of schooling, years 
of work experience, and experience-squared) and three instruments (a 
college proximity indicator, age, and age-squared). Interest lies in the 
coefficient of schooling. Perform appropriate diagnostics and tests for weak 
instruments for the following model. State what you conclude. The 
following commands yield the Iv estimator: 


10. 


11. 


12. 


13. 


. use mus207klingdata.dta, clear 


. global x2list black south76 smsa76 reg2-reg9 smsa66 
> sinmom14 nodaded nomomed daded momed famedi-famed8 


. ivregress 2sls wage76 (grade76 exp76 expsq76 = col4 age76 agesq76) $x2list, 
> vce(robust) perfect 


. estat firststage 


Use the same dataset as the previous question. Treat only grade76 as 
endogenous, let exp76 and expsq76 be exogenous, and use co14 as the only 
instrument. Perform appropriate diagnostics and tests for a weak instrument, 
and state what you conclude. Then, use the community-contributed 
condivreg command to perform inference, and compare the results with 
those using conventional asymptotics. 

When an endogenous variable enters the regression nonlinearly, the obvious 
Iv estimator is inconsistent and a modification is needed. Specifically, 
suppose yı = Gy + u and the first-stage equation for Y2 is y2 = T27 + v, 
where the zero-mean errors u and v are correlated. Here the endogenous 
regressor appears in the structural equation as y2 rather than y2. The Iv 
estimator is Bw = S TA a yoy eyes This can be implemented by a 
regular Iv regression of y on y2 with the instrument z: regress y2 on z and 
then regress yı on the first-stage prediction ye If instead we regress Y2 on z 


at the first stage, giving 9>, and then regress Yı on (%)?, an inconsistent 
estimate is obtained. Generate a simulation sample to demonstrate these 
points. Consider whether this example can be generalized to other nonlinear 
models where the nonlinearity is in regressors only, so that 

yı = g(y2)’G + u, where g(y2) is a nonlinear function of y2. 

This exercise is based on artificially generated data. Set sample size to 

1,000. Set seed to 10101. Generate a sample using the following DGP: 

u ~ N(0,1);e ~ 0.6u + N(0,1); z ~ uniform(0, 1); 

yo = z + 0.52? + 0.062523 + e; £ ~ N(0,9); y1 = yo + 0.5y2 +e + u. 
Estimate the yı equation by oOLs. Theoretically, the estimates of the equation 
are inconsistent. Compare all the slope coefficients with their true values, 
and discuss whether the direction of the bias is as you might expect. OLS is 
inconsistent in this case, yet some coefficients are estimated close to their 
true values; explain why. Run the same regression again with the intercept 
constrained to zero, and comment on any differences that you might observe. 
Continuing the previous exercise, estimate three variants of the yi-equation 
by 2SLS using as instruments i) x, z, and z2; ii) x, z, z2, and 73; iii) 7 and 92, 
where J» is the fitted value of y2 from the reduced form of y2. Are the 
instruments chosen in each case valid and relevant? Which of the three 
estimates would you prefer on theoretical grounds? Explain your answer. 


14. 


With reference to the 2SLs estimates used earlier, apply the test of 
overidentification where appropriate, and interpret your results. 

This question continues with the model presented in question 12 above. 
Modify the specification of the yı equation such that the equation errors are 
heteroskedastic, with skedasticity being a quadratic function of x; that is, 
Var(u) = h(x) x N(0,1). Reestimate the yı equation by 2sLs and two-step 
GMM using as instruments z, z, z?, z’. In both cases, apply the test of 
overidentifying restrictions. Compare the results, and comment on the 
outcome of the test. 


Chapter 8 
Linear panel-data models: Basics 


8.1 Introduction 


Panel data or longitudinal data are repeated measurements at different 
points in time on the same individual unit, such as person, firm, state, or 
country. Regressions can then capture both variation over units, similar to 
regression on cross-sectional data, and variation over time. 


Panel-data methods are more complicated than cross-sectional—data 
methods. The standard errors of panel-data estimators need to be adjusted 
because each additional time period of data in general is not independent of 
previous periods. Panel data requires the use of much richer models and 
estimation methods. Also, different areas of applied statistics use different 
methods for essentially the same data. The Stata xt commands, where xt is 
an acronym for cross-sectional time series, cover many of these methods. 


We focus on methods for a short panel, meaning data on many 
individual units and few time periods. Examples include longitudinal 
surveys of many individuals and panel datasets on many firms. And we 
emphasize microeconometrics methods that attempt to estimate key 
marginal effects that can be given a causative interpretation. 


The essential panel-data methods are given in this chapter, most notably, 
the important distinction between fixed-effects (FE) and random-effects (RE) 
models. The panel methods overlap considerably with those presented in 
sections 6.4—6.7. 


Chapter 9 presents many other panel-data methods for the linear model, 
including those for instrumental-variables (Iv) estimation, estimation when 
lagged dependent variables are regressors, estimation when panels are long 
rather than short, and estimation of mixed models with slope parameters 
that vary across individuals. Nonlinear panel models are presented in 
chapter 22. 


8.2 Panel-data methods overview 


There are many types of panel data and goals of panel-data analysis, leading 
to different models and estimators for panel data. We provide an overview in 
this section, with subsequent sections illustrating many of the various 
models and estimation methods. 


8.2.1 Some basic considerations 


First, panel data are usually observed at regular time intervals, as is the case 
for most time-series data. A common exception is growth curve analysis 
where, for example, children are observed at several irregularly spaced 
intervals in time and a measure such as height or IQ is regressed on a 
polynomial in age. 


Second, panel data can be balanced, meaning all individual units are 
observed in all time periods (T; = T for all į), or unbalanced (T; 4 T for 
some 7). Most xt commands can be applied to both balanced and unbalanced 
data. In either case, however, estimator consistency requires that the sample- 
selection process not lead to errors being correlated with regressors. Loosely 
speaking, the missingness is for random reasons rather than systematic 
reasons; see section 19.10. 


Third, the dataset may be a short panel (few time periods and many 
individuals); a long panel (many time periods and few individuals); or both 
(many time periods and many individuals). This distinction has 
consequences for both estimation and inference. 


Fourth, model errors are very likely correlated. Microeconometrics 
methods emphasize correlation (or clustering) over time for a given 
individual, with independence over individual units. For some panel 
datasets, such as country panels, there additionally may be correlation across 
individuals. Regardless of the assumptions made, some correction to default 
ordinary least-squares (OLS) standard errors is usually necessary, and 
efficiency gains using generalized least squares (GLS) may be possible. 


Fifth, regression coefficient identification for some estimators can 
depend on regressor type. Some regressors, such as gender, may be time 
invariant with zit = £; for all t. Some regressors, such as an overall time 
trend, may be individual invariant with xj: = 7; for all 7. And some may 
vary over both time and individuals. 


Sixth, some or all model coefficients may vary across individuals or over 
time. 


Seventh, the microeconometrics literature emphasizes the FE model. This 
model, explained in the next section, permits regressors to be endogenous 
provided that they are correlated only with a time-invariant component of 
the error. Most other branches of applied statistics instead emphasize the RE 
model, which assumes that regressors are completely exogenous. 


Finally, panel data permit estimation of dynamic models where lagged 
dependent variables may be regressors. Most panel-data analyses use models 
without this complication. 


In this chapter, we focus on short panels (T fixed and N — oo) with 
model errors assumed to be independent over individuals. We consider linear 
models with and without fixed effects—static models in this chapter and 
dynamic models in the subsequent chapter. Long panels are treated 
separately in section 9.5. 


Most applications in this chapter use balanced panels. Unbalanced panels 
arise from missing data often arising from panel attrition, which simply 
means that respondents drop out of the panel altogether or have gaps in their 
participation in the survey. Most commands can also be applied to 
unbalanced panels. However, panel attrition can lead to inconsistent 
parameter estimates if it is not random after controlling for observable 
variables. Methods that correct for any bias that arises because of panel 
attrition are presented in section 19.11. 


8.2.2 Some basic panel models 


There are several different linear models for panel data. 


The fundamental distinction is that between FE and RE models. The term 
“fixed effects” is misleading because in both types of models, individual- 
level effects are random. FE models have the added complication that 
regressors may be correlated with the individual-level effects so that 
consistent estimation of regression parameters requires eliminating or 
controlling for the fixed effects. 


Individual-effects model 


The individual-specific effects model for the scalar dependent variable Yit 
specifies that 


Yit = Qi + Xab + Eit, t= lessz li eal Peres N (8.1) 


where X;: are regressors, @; are random individual-specific effects, and Eit is 
an idiosyncratic error. For simplicity, we mostly present results with T; = T. 


Two quite different models for the a; are the FE and RE models. 


Fixed-effects model 


In the FE model, the a; in (8.1) are permitted to be correlated with the 
regressors X;z. This allows a limited form of endogeneity. We view the error 
in (8.1) as wiz = Qi + Eit and permit Xit to be correlated with the time- 
invariant component of the error (a;), while continuing to assume that X; is 
uncorrelated with the idiosyncratic error €it. For example, we assume that if 
regressors in an earnings regression are correlated with unobserved ability, 
they are correlated only with the time-invariant component of ability, 
captured by ai. 


One possible estimation method is to jointly estimate 01,...,n and 8. 
But for a short panel, asymptotic theory relies on N — oo, and here as 
N — œ So too does the number of fixed effects to estimate. This problem is 
called the incidental-parameters problem. Interest lies in estimating 8, but 
first we need to control for the nuisance or incidental parameters, ©;. 


Instead, we can still consistently estimate G, for time-varying regressors, 
by appropriate differencing transformations detailed in sections 8.5 and 8.9 
that eliminate a;. 


The FE model implies that E'(y;z|a;, Xit) = a; + X468, assuming 
E(citlai, Xit) = 0, so Bj = OE (yitl|oi, Xiz)/OX;,it. The attraction of the FE 
model is that we can obtain a consistent estimate of the marginal effect of 
the jth regressor on E (yit|&i, Xit), provided Tj,it is time varying, even if the 
regressors are endogenous (albeit, a limited form of endogeneity). 


At the same time, knowledge of 3 does not give complete information 
on the process generating Yit. In particular for prediction, we need an 
estimate of E (yi|Xi) = E(ai|xiz) + x46, and E(a;|x;,) cannot be 
consistently estimated in short panels. 


In nonlinear FE models, these results need to be tempered. We cannot 
always eliminate @;, which is shown in section 22.2. And even if it is, 
consistent estimation of 3 may still not lead to a consistent estimate of the 
marginal effect OE (yiz|ai, Xiz)/OX; it. 


Random-effects model 


In the RE model, it is assumed that a; in (8.1) is purely random, a stronger 
assumption implying that a; is uncorrelated with the regressors. 


Estimation is then by a feasible generalized least-squares (FGLS) 
estimator, given in section 8.6. Advantages of the RE model are that it yields 
estimates of all coefficients and hence marginal effects, even those of time- 
invariant regressors, and that E’(y;4|x,;4) can be estimated. The big 
disadvantage is that these estimates are inconsistent if the FE model is 
appropriate. 


Correlated RE model 


A variation of the RE model controls for fixed effects by adding individual- 
specific means of time-varying regressors as additional regressors; see 
section 8.7.4. 


Pooled model or population-averaged model 


Pooled models assume that regressors are exogenous and simply write the 
error as uit rather than using the decomposition œ; + £it. Then, 


Yit = A+ XB + Ui (8.2) 


Note that x;; here does not include a constant, whereas in cross-sectional 
chapters, x; additionally included a constant term. 


OLS estimation of the parameters of this model is straightforward, but 
inference needs to control for likely correlation of the error Uit over time for 
a given individual (within correlation) and possible correlation over 
individuals (between correlation). FGLS estimation of (8.2) given an assumed 
model for the within correlation of uit is presented in section 8.4. In the 
statistics literature, this is called a population-averaged (PA) model. Like RE 
estimators, consistency of the estimators requires that regressors be 
uncorrelated with uit. 


Two-way effects model 


A standard extension of the individual effects is a two-way effects model 
that allows the intercept to vary over individuals and over time: 


Yit = Qi + Yt + Xb + Ei (8.3) 


For short panels, it is common to let the time effects 7 be fixed effects. Then 
(8.3) reduces to (8.1), if the regressors in (8.1) include a set of time dummies 
(with one time dummy dropped to avoid the dummy-variable trap). 


Mixed linear models 


If the RE model is appropriate, richer models can permit slope parameters to 
also vary over individuals or time. The mixed linear model (see section 6.7) 
is a hierarchical linear model that is quite flexible and permits random 


parameter variation to depend on observable variables. The random- 
coefficients model is a special case that specifies 


l 
Yit = Qi + X;ßi + Eit 


where (a; B)! ~ (B, X). For a long panel with few individuals, œ; and 6; 
can instead be parameters that can be estimated by running separate 
regressions for each individual. 


8.2.3 Cluster—robust inference 


Various estimators for the preceding models are given in subsequent 
sections. These estimators are usually based on the assumption that the 
idiosyncratic error ¢;, ~ (0,02). This assumption is often not satisfied in 
panel applications. Then many panel estimators still retain consistency, 
provided that €it are independent over 7, but reported standard errors are 
incorrect. 


For short panels, we can obtain cluster—robust standard errors under the 
weaker assumptions that errors are independent across individuals and that 
N > oo. Specifically, E (e€it€js) = 0 for i A j, E(EitEis) is unrestricted, and 
Eit may be heteroskedastic. 


Where applicable, we use cluster—robust standard errors rather than the 
Stata defaults. In particular, while the inclusion of random or fixed effects a; 
can partly account for within-individual error correlation over time, in 
practice it is often insufficient, and failure to additionally cluster on the 
individual typically leads to underestimation of standard errors and 
overstatement of estimator precision. 


For most, but not all, xt commands, the vce (robust) option is available. 
When this option is available, it produces cluster—robust standard errors, 
rather than heteroskedastic-robust standard errors, with clustering on the 
individual unit that is defined in the xt set command. In some applications, 
one should cluster at a higher level than the individual. For example, with 
individual-level panel data and the key regressor a policy variable that varies 


at the region level, such as state, one should use the vce (cluster 
clustvar) option of the xt command, if it is available. Care is needed in 
using cluster—robust standard errors for panel estimators because in some 
cases, including some xt commands that have a vce (robust) option, 
consistent estimation may require that there be no within-individual error 
correlation. In some cases the vce (bootstrap) OF vce (jackknife) option 
can be used to obtain cluster—robust standard errors because, for xt 
commands, these usually resample over clusters. But again, within-cluster 
correlation may lead to the more serious problem of inconsistent parameter 
estimation. 


In a seminal article, Bertrand, Duflo, and Mullainathan (2004) analyzed 
state-year panel data with a state-level policy variable. They had two major 
findings. First, even if individual-level data are available, clustering should 
be on state, rather than on state-year pair. Second, when the number of states 
is few and clustering is on states, then the standard cluster—robust Wald test 
overrejects substantially. 


Sections 6.2.4 and 6.4 provide considerable detail on cluster—robust 
inference for OLS, and Cameron and Miller (2015), MacKinnon and 
Webb (2020), and MacKinnon, Nielsen, and Webb (Forthcoming) provide 
surveys. The problem of inference with few clusters is detailed in 
section 6.4.6. A general method for inference with few clusters applies the 
percentile-t method to a particular bootstrap, the wild cluster bootstrap; see 
section 12.6. The community-contributed boottest command (Roodman 
et al. 2019) presented in section 12.6.2 implements this bootstrap following a 
wide range of estimation commands, including the commands regress, 
areg, and xtreg, fe used in linear model analysis of panel data. 


8.2.4 The xtreg command 


The key command for estimation of the parameters of a linear panel-data 
model is the xt reg command, also used for clustered cross-sectional data; 
see chapter 6. The command syntax is 


xtreg depvar | indepvars | [ af | |in | | weight | la options | 


The individual identifier must first be declared with the xt set command. 


The key model options are PA model (pa), FE model (fe), RE model (re 
and mle), and between-effects model (be). The individual models are 
discussed in detail in subsequent sections. The weight modifier is available 
only for fe, mle, and pa. 


The vce (robust) option provides cluster—robust estimates of the 
standard errors for all models but be. Stata labels the estimated VCE as 


simply “Robust” because the use of xt reg implies that we are in a clustered 
setting. 


8.2.5 Stata linear panel-data commands 


Table 8.1 summarizes xt commands for viewing panel data and estimating 
the parameters of linear panel-data models. 


Table 8.1. Summary of xt commands for linear panel models 


Data summary xtset; xtdescribe; xtsum; xtdata; xtline; xttab; 


xttrans 
Pooled OLS 
Pooled FGLS 


regress 
xtgee, family(gaussian); xtgls; xtpcse 


RE xtreg, re; xtregar, re 


FE 

Random slopes 
First-differences 
Differences in differences 
Static IV 

Dynamic IV 

Unit-root tests 


Cointegration tests 


xtreg,fe; xtregar, fe 

mixed; xtrc 

regress (with differenced data) 
xtdidregress 

xtivreg; xthtaylor 

xtabond; xtdpdsys; xtdpd 
xtunitroot 


xtcointtest 


The core methods for short panels, notably the data summary commands 
and the xt reg and xtgee commands, are presented in this chapter, with more 
specialized commands for panel Iv estimation and for long panels presented 


in chapter 9. Readers with long panels should look at section 9.5 (xtgis, 
xtpcse, xtregar), and data input may require first reading section 8.10. 
Some additional panel commands for censored regression and for nonlinear 
models are given in table 13.1. 


8.3 Summary of panel data 


In this section, we present various ways to summarize and view panel data 
and estimate a pooled OLS regression. The dataset used is a panel on log 
hourly wages and other variables for 595 people over the seven years 1976— 
1982. 


8.3.1 Data description and summary statistics 
The data, from Baltagi and Khanti-Akom (1990), were drawn from the Panel 


Study of Income Dynamics and are a corrected version of data originally 
used by Cornwell and Rupert (1988). 


The dataset has the following data: 


. * Read in dataset and describe 
. qui use mus208psid 


. describe 


Contains data from mus208psid.dta 


Observations: 4,165 A.C.Cameron & P.K.Trivedi 
(2022): Microeconometrics Using 
Stata, 2e 
Variables: 22 1 Sep 2020 16:38 
(_dta has notes) 
Variable Storage Display Value 
name type format label Variable label 
exp float %9.0g Years of full-time work experience 
wks float %9.0g Weeks worked 
occ float 49 .0g Occupation; occ==1 if ina 
blue-collar occupation 
ind float 49 .0g Industry; ind==1 if working in a 
manufacturing industry 
south float “49. 0g Residence; south==1 if in the 
South area 
smsa float 49. 0g smsa==1 if in the standard 
metropolitan statistical area 
ms float %9.0g Marital status 
fem float “49 .0g Female or male 
union float 7%9.0g If wage set be a union contract 
ed float “49 .0g Years of education 
blk float %9.0g Black 
lwage float 7%9.0g log wage 
id float 7%9.0g Person ID 
t float %9.0g Year index 
tdum1 byte 78 .0g t==1.0000 
tdum2 byte 78 .0g t==2. 0000 
tdum3 byte 48 .0g t==3.0000 
tdum4 byte 78 .0g t==4.0000 
tdum5 byte 48 .0g t==5 .0000 
tdum6 byte 78 .0g t==6 .0000 
tdum7 byte 48 .0g t==7 . 0000 
exp2 float %9.0g Years of experience squared 


Sorted by: id t 


There are 4,165 individual—year pair observations. The variable labels 
describe the variables fairly clearly, though note that 1wage is the log of 
hourly wage in cents, the indicator fem is 1 if female, id is the individual 
identifier, t is the year, and exp2 is the square of exp. 


Descriptive statistics can be obtained by using the command summarize: 


. * Summary of dataset 


. Summarize 
Variable Obs Mean Std. dev. Min Max 
exp 4,165 19.85378 10.96637 1 51 
wks 4,165 46.81152 5.129098 5 52 
occ 4,165 .5111645 . 4999354 (0) 1 
ind 4,165 . 3954382 . 4890033 (0) 1 
south 4,165 . 2902761 . 4539442. (0) 1 
smsa 4,165 .6537815 .475821 (0) 1 
ms 4,165 .8144058 . 3888256 (0) 1 
fem 4,165 .112605 .3161473 (0) 1 
union 4,165 . 3639856 . 4812023 (0) 1 
ed 4,165 12.84538 2.787995 4 17 
blk 4,165 .0722689 . 2589637 0 1 
lwage 4,165 6.676346 .4615122 4.60517 8.537 
id 4,165 298 171.7821 1 595 
t 4,165 4 2.00024 1 7 
tdum1 4,165 . 1428571 . 3499691 (0) 1 
tdum2 4,165 . 1428571 . 3499691 (0) 1 
tdum3 4,165 . 1428571 . 3499691 (0) 1 
tdum4 4,165 . 1428571 . 3499691 (0) 1 
tdum5 4,165 . 1428571 . 3499691 (0) 1 
tdum6 4,165 . 1428571 . 3499691 (0) 1 
tdum7 4,165 . 1428571 . 3499691 (0) 1 
exp2 4,165 514.405 496.9962 1 2601 


The variables take on values that are within the expected ranges, and there 
are no missing values. Both men and women are included, though from the 
mean of fem, only 11% of the sample is female. Wages data are nonmissing 
in all years, and weeks worked are always positive, so the sample is restricted 
to individuals who work in all seven years. 


8.3.2 Panel-data organization 


The xt commands require that panel data be organized in so-called long 
form, with each observation a distinct individual—time pair, here an 
individual—year pair. Data may instead be organized in wide form, with a 
single observation combining data from all years for a given individual or 
combining data on all individuals for a given year. Then, the data need to be 


converted from wide form to long form by using the reshape command 
presented in section 8.10. 


Data organization can often be clear from listing the first few 
observations. For brevity, we list the first three observations for a few 
variables: 


. * Organization of dataset 
. list id t exp wks occ in 1/3, clean 


id t exp wks occ 


1. 1 1 3 32 0 
2. 1 2 4 43 0 
3. 1 3 5 40 0 


The first observation is for individual 1 in year 1, the second observation is 
for individual 1 in year 2, and so on. These data are thus in long form. From 
summarize, the panel identifier id takes on the values 1-595, and the time 
variable t takes on the values 1—7. In general, the panel identifier need just be 
a unique identifier, and the time variable could take on values of, for 
example, 76—82. 


The panel-data xt commands require that, at a minimum, the panel 
identifier be declared. Many xt commands also require that the time 
identifier be declared. This is done by using the xt set command. Here we 
declare both identifiers: 


. * Declare individual identifier and time identifier 
. xtset id t 


Panel variable: id (strongly balanced) 
Time variable: t, 1 to 7 
Delta: 1 unit 


The panel identifier is given first, followed by the optional time identifier. 
The output indicates that data are available for all individuals in all time 
periods (strongly balanced), and the time variable increments uniformly by 
one. 


When a Stata dataset is saved, the current settings, if any, from xtset are 
also saved. In this particular case, the original dataset mus208psid.dta 
already contained this information, so the preceding xtset command was 


actually unnecessary. The xtset command without any arguments reveals the 
current settings, if any. 


8.3.3 Panel-data description 


Once the panel data are xtset, the xtdescribe command provides 
information about the extent to which the panel is unbalanced. 


. * Panel description of dataset 


. xtdescribe 
id: 1, 2, ..., 595 n= 595 
te T 22s. wag h T= T 


Delta(t) = 1 unit 
Span(t) = 7 periods 
(id*t uniquely identifies each observation) 


Distribution of T_i: min 5% 25% 50% 75% 95% max 
7 7 7 7 7 7 7 
Freq. Percent Cum. Pattern 


595 100.00 100.00 1111111 


595 100.00 XXXXXXX 


In this case, all 595 individuals have exactly 7 years of data. The data are 
therefore balanced because, additionally, the earlier summarize command 
showed that there are no missing values. Section 22.3 provides an example of 
xtdescribe with unbalanced data and covers ways of balancing an 
unbalanced sample if so desired. 


8.3.4 Within and between variation 


Dependent variables and regressors can potentially vary over both time and 
individuals. Variation over time or a given individual is called within 
variation, and variation across individuals is called between variation. This 
distinction is important because estimators differ in their use of within and 
between variation. In particular, in the FE model, the coefficient of a regressor 
with little within variation will be imprecisely estimated and will be not 
identified if there is no within variation at all. 


The xtsum, xttab, and xttrans commands provide information on the 
relative importance of within variation and between variation of a variable. 


We begin with xtsum. The total variation (around grand mean 
T =1/NT)_, > °, vit) can be decomposed into within variation over time 
for each individual (around individual mean 7; = 1/T 5°, xj) and between 


variation across individuals (for z around 7;). The corresponding 
decomposition for the variance is 


Within variance: sy = pA (we — Ti)? = WES DY De (wit -— Ti + FY? 


Between variance: = 745 (Gi - 2) 


s 
Overall variance: sô = yo X; Deltu — T)’ 


The second expression for s%, is equivalent to the first, because adding a 
constant does not change the variance, and is used at times because 

Zit — Ti + T is centered on g, providing a sense of scale, whereas x; — T; 1S 
centered on zero. For unbalanced data, replace NT in the formulas with 

>; Ti It can be shown that 52, ~ s%, + s2. 


The xtsum command provides this variance decomposition. We do this 
for selected regressors and obtain 


. * Panel summary statistics: Within and between variation 
. xtsum id t lwage ed exp exp2 wks south tduml 


Variable Mean Std. dev. Min Max Observations 
id overall 298 171.7821 1 595 N = 4165 
between 171.906 1 595 n= 595 
within (0) 298 298 T= 7 
t overall 4 2.00024 1 7 N = 4165 
between (0) 4 4 n = 595 
within 2.00024 1 7 T= 7 
lwage overall 6.676346 .4615122 4.60517 8.537 N = 4165 
between . 3942387 5.3364 7.813596 n= 595 
within . 2404023 4.781808 8.621092 T= 7 
ed overall 12.84538 2.787995 4 17 N = 4165 
between 2.790006 4 17 n= 595 
within (0) 12.84538 12.84538 T= 7 
exp overall 19.85378 10.96637 1 51 N= 4165 
between 10.79018 4 48 n= 595 
within 2.00024 16.85378 22.85378 T= 7 
exp2 overall 514.405 496.9962 1 2601 N = 4165 
between 489.0495 20 2308 n= 595 
within 90.44581 231.405 807.405 T= 7 
wks overall 46.81152 5.129098 5 52 N = 4165 
between 3.284016 31.57143 51.57143 n= 595 
within 3.941881 12.2401 63.66867 T= 7 
south overall . 2902761 . 4539442 (0) 1 N = 4165 
between . 4489462 (0) 1 n= 595 
within .0693042 -.5668667 1.147419 T= 7 
tdumi overall . 1428571 . 3499691 (0) 1 N = 4165 
between (0) . 1428571 . 1428571 n= 595 
within . 3499691 (0) 1 T 7 


Time-invariant regressors have zero within variation, so the individual 
identifier ia and the variable ea are time invariant. Individual-invariant 
regressors have zero between variation, so the time identifier t and the time 
dummy tdum1 are individual invariant. For all other variables but wks, there 
is more variation across individuals (between variation) than over time 
(within variation), so within estimation may lead to considerable efficiency 
loss. What is not clear from the output from xtsum is that while variable exp 
has nonzero within variation, it evolves deterministically because for this 


sample, exp increments by one with each additional period. The min and max 
columns give the minimums and maximums of zit for overall, z; for 
between, and Zit — Zi + T for within. 


In the xtsum output, Stata uses lowercase n to denote the number of 
individuals and uppercase N to denote the total number of individual—time 
observations. In our notation, these quantities are, respectively, N and 


ar Ty 


The xttab command tabulates data in a way that provides additional 
details on the within and between variation of a variable. For example, 


. * Panel tabulation for a variable 
. xttab south 


Overall Between Within 

south Freq. Percent Freq. Percent Percent 

0 2956 70.97 428 71.93 98.66 

1 1209 29.03 182 30.59 94.90 

Total 4165 100.00 610 102.52 97.54 
(n = 595) 


The overall summary shows that 72% of the 4,165 individual—year 
observations had south = 0 and 31% had south = 1. The between 
summary indicates that of the 595 people, 72% had south = 0 at least once 
and 31% had south = 1 at least once. The between total percentage is 
102.52 because 2.52% of the sampled individuals (15 persons) lived some of 
the time in the south and some not in the south and hence are double counted. 
The within summary indicates that 95% of people who ever lived in the south 
always lived in the south during the time period covered by the panel and 
99% who lived outside the south always lived outside the south. The south 
variable is close to time invariant. 


The xttab command is most useful when the variable takes on few values 
because then there are few values to tabulate and interpret. 


The xttrans command provides transition probabilities from one period 
to the next. For example, 


. * Transition probabilities for a variable 
. xttrans south, freq 


Residence; 
south== Residence; south== 
if in the if in the South area 
South area (0) 1 Total 
(0) 2,527 8 2,535 
99.68 0.32 100.00 
1 8 1,027 1,035 
0.77 99.23 100.00 
Total 2,535 1,035 3,570 
71.01 28.99 100.00 


One time period is lost in calculating transitions, so 3,570 observations are 
used. For time-invariant data, the diagonal entries will be 100%, and the off- 
diagonal entries will be 0%. For south, 99.2% of the observations ever in the 
south for one period remain in the south for the next period. And for those 
who did not live in the south for one period, 99.7% remained outside the 
south for the next period. The south variable is close to time invariant. 


The xttrans command is most useful when the variable takes on few 
values. 


8.3.5 Time-series plots for each individual 


It can be useful to provide separate time-series plots for some or all 
individual units. 


Separate time-series plots of a variable for one or more individuals can be 
obtained by using the xt1ine command. The overlay option overlays the 
plots for each individual on the same graph. For example, 


. quietly xtline lwage if id<=20, overlay 


produces overlaid time-series plots of 1wage for the first 20 individuals in the 
sample. 


We provide time-series plots for the first 20 individuals in the sample. 
The default is to provide a graph legend that identifies each individual that 


appears in the graph and takes up much of the graph if the graph uses data 
from many individuals. This legend can be suppressed by using the 

legend (off) option. Separate plots are obtained for 1wage and for wks, and 
these are then combined by using the graph combine command. We have 


. * Simple time-series plot for each of 20 individuals 
. qui xtline lwage if id<=20, overlay legend(off) saving(graphi.gph, replace) 


. qui xtline wks if id<=20, overlay legend(off) saving(graph2.gph, replace) 
. graph combine graphi.gph graph2.gph, iscale(1.4) ysize(2.5) xsize(6.0) 


Figure 8.1 shows that the wage rate increases roughly linearly over time, 
aside from two individuals with large increases from years 1 to 2, and that 
weeks worked show no discernible trend over time. 
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Figure 8.1. Time-series plots of log wage against year and weeks 
worked against year for each of the first 20 observations 


8.3.6 Overall scatterplot 


In cases where there is one key regressor, we can begin with a scatterplot of 


the dependent variable on the key regressor, using data from all panel 
observations. 


The following command adds fitted quadratic regression and lowess 
regression curves to the scatterplot. 


. graph twoway (scatter lwage exp) (qfit lwage exp) (lowess lwage exp) 


This produces a graph that is difficult to read because the scatterplot points 
are very large, making it hard to then see the regression curves. 


The following code presents a better-looking scatterplot of Inwage on 
exp, along with the fitted regression lines. It uses the same graph options as 
those explained in section 2.6.6. We have 


. * Scatterplot, quadratic fit, and nonparametric regression (lowess) 

graph twoway (scatter lwage exp, msize(small) msymbol(o)) 
(qfit lwage exp, clstyle(p3) lwidth(medthick) ) 
(lowess lwage exp, bwidth(0.4) clstyle(p1) lwidth(medthick)), 
plotregion(style(none)) scale(1.2) 
title("Overall variation: log wage versus experience") 
xtitle("Years of experience", size(medlarge)) xscale(titlegap(*5) ) 
ytitle("log hourly wage", size(medlarge)) yscale(titlegap(*5) ) 
legend(pos(4) ring(0) col(1)) legend(size(small1l) ) 
legend(label(1 "Actual data") label(2 "Quadratic fit") label(3 "Lowess")) 
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Each point on figure 8.2 represents an individual—year pair. The dashed 
smooth curve line is fit by OLS of lwage on a quadratic in exp (using qfit), 
and the solid line is fit by nonparametric regression (using lowess). Log 
wage increases until 30 or so years of experience and then declines. 


Overall variation: log wage versus experience 
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Figure 8.2. Overall scatterplot of log wage against experience 
using all observations 


8.3.7 Within scatterplot 


The xtdata command can be used to obtain similar plots for within variation, 
using option fe; between variation, using option be; and RE variation (the 
default), using option re. The xtdata command replaces the data in memory 
with the specified transform, so you should first preserve the data and then 
restore the data when you are finished with the transformed data. 


For example, the fe option creates deviations from means, so that 
(Yit — Y; +y) is plotted against (x;, — zi + T). For 1wage plotted against 
exp, we obtain 


. * Scatterplot for within variation 
. preserve 


. xtdata, fe 


graph twoway (scatter lwage exp, msize(small) msymbol(o)) 
(qfit lwage exp, clstyle(p3) lwidth(medthick) ) 
(lowess lwage exp, bwidth(0.4) clstyle(p1) lwidth(medthick)), 
plotregion(style(none)) scale(1.2) 
title("Within variation: log wage versus experience") 
xtitle("Years of experience", size(medlarge)) xscale(titlegap(*5) ) 
ytitle("log hourly wage", size(medlarge)) yscale(titlegap(*5)) 
legend(pos(11) ring(0) col(1)) legend(size(small1l) ) 
legend(label(1 "Actual data") label(2 "Quadratic fit") label(3 "Lowess") ) 
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. restore 


The result is given in figure 8.3. At first glance, this figure is puzzling 
because only seven distinct values of exp appear. But the panel is balanced, 
and exp (years of work experience) is increasing by exactly one each period 
for each individual in this sample of people who worked every year. So 
(xi — Ti) increases by one each period, as does (x;4 — zi + T). The latter 
quantity is centered on z = 19.85 (see section 8.3.1), which is the value in 
the middle year with t = 4. Clearly, it can be very useful to plot a figure such 
as this. 


Within variation: log wage versus experience 
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Figure 8.3. Within scatterplot of log-wage deviations from 
individual means against experience deviations from 
individual means 


8.3.8 Pooled OLS regression with cluster—robust standard errors 


A natural starting point is a pooled OLS regression for log wage using data for 
all individuals in all years. 


We include as regressors education, weeks worked, and a quadratic in 
experience. Education is a time-invariant regressor, taking the same value 
each year for a given individual. Weeks worked is an example of a time- 
varying regressor. Experience is also time-varying, though it is so 
deterministically because the sample comprises people who work full time in 
all years, so experience increases by one year as ¢ increments by one. 


Regressing Yit on Xit yields consistent estimates of 8 if the composite 
error Uit in the pooled model of (8.2) is uncorrelated with X:+. As explained 
in section 8.2, the error uit is likely to be correlated over time for a given 
individual, so we use cluster—robust standard errors that cluster on the 
individual. We have 


. * Pooled OLS with cluster--robust standard errors 
. use mus208psid, clear 


(A.C.Cameron & P.K.Trivedi (2022): Microeconometrics Using Stata, 2e) 
. regress lwage exp exp2 wks ed, vce(cluster id) 
Linear regression Number of obs = 4,165 
F(4, 594) 72.58 
Prob > F 0.0000 
R-squared = 0.2836 
Root MSE = . 39082 
(Std. err. adjusted for 595 clusters in id) 
Robust 

lwage | Coefficient std. err. t P>|t| [95% conf. interval] 
exp .044675 . 0054385 8.21 0.000 .0339941 .055356 
exp2 -.0007156 .0001285 -5.57 0.000 -.0009679 -.0004633 
wks .005827 .0019284 3.02 0.003 . 0020396 .0096144 
ed .0760407 .0052122 14.59 0.000 . 0658042 .0862772 
_cons 4.907961 . 1399887 35.06 0.000 4.633028 5.182894 


The output shows that R2 — 0.28, and the estimates imply that wages 
increase with experience until a peak at 31 years [= 0.0447/(2 x 0.00072)| 
and then decline. Wages increase by 0.6% with each additional week worked. 
And wages increase by 7.6% with each additional year of education. 


For panel data, it is essential that OLS standard errors be corrected for 
clustering on the individual. In contrast, the default standard errors assume 
that the regression errors are independent and identically distributed (1.1.d.). 
Using the default standard errors, we obtain 


. * Pooled OLS with incorrect default standard errors 
. regress lwage exp exp2 wks ed 


Source SS df MS Number of obs = 4,165 
C FC, 4160) = 411.62 
Model 251.491445 4 62.8728613 Prob > F 0.0000 
Residual 635.413457 4,160 .152743619 R-squared 0.2836 
Adj R-squared = 0.2829 

Total 886 . 904902 4,164 .212993492 Root MSE = . 39082 
lwage | Coefficient Std. err. t P>ltl [95% conf. interval] 
exp .044675 . 0023929 18.67 0.000 . 0399838 . 0493663 

exp2 -.0007156 .0000528 -13.56 0.000 -.0008191 -.0006121 

wks .005827 .0011827 4.93 0.000 .0035084 .0081456 

ed .0760407 . 0022266 34.15 0.000 .0716754 . 080406 

_cons 4.907961 0673297 72.89 0.000 4.775959 5.039963 


These standard errors are misleadingly small; the cluster—robust standard 
errors are, respectively, 0.0054, 0.0001, 0.0019, and 0.0052. 


It is likely that if log wage is overpredicted in one year for a given person, 
then it is likely to be overpredicted in other years. Failure to control for this 
error correlation leads to underestimation of standard errors because, 
intuitively, each additional observation for a given person actually provides 
less than an independent piece of new information. 


The difference between default and cluster—robust standard errors for 
pooled OLs can be very large. The difference increases with increasing T, 
increasing autocorrelation in model errors, and increasing autocorrelation of 
the regressor of interest. Specifically, from section 3.4.6, the standard error 
inflation factor r ~ 4/1 + pupz(T — 1), where Pu is the intraclass 
correlation of the error, defined below in (8.4), and Px is the intraclass 
correlation of the regressor. Here p,, ~ 0.80, shown below, and for time- 
invariant regressor ed, py = 1,80 7~V/1+0.80x1x6= v5.8 ~œ 2.41 
for ea. Similarly, the regressor exp has pz close to 1 because for this sample, 
experience increases by one year as ¢ increments by 1. 


Cluster—robust standard errors require that N — œ and that errors are 
independent over i. The assumption of independence over ; can be relaxed to 
independence at a more aggregated level, provided that the number of units is 
still large and the units nest the individual. For example, the Panel Study of 
Income Dynamics is a household survey, and errors for individuals from the 
same household may be correlated. If, say, houseid is available as a 
household identifier, then we would use the vce (cluster houseid) option. 
As a second example, if the regressor of interest is aggregated at the state 
level, such as a state policy variable, and there are many states, then it may be 
better to use the vce (cluster state) option. 


8.3.9 Time-series autocorrelations for panel data 


The Stata time-series operators can be applied to panel data when both panel 
and time identifiers are set with the xtset command. Examples include 
L.lwage Or L1.1lwage for lwage lagged once, L2.1lwage for lwage lagged 
twice, D. 1wage for the difference in 1wage (equals lwage —L.1wage), 


LD. 1wage for this difference lagged once, and L2D.1wage for this difference 
lagged twice. 


Use of these operators is the best way to create lagged variables because 
relevant missing values are automatically and correctly created. For example, 
regress lwage L2.wage will use (7 — 2) x 595 observations because forming 
L2.wage leads to a loss of the first 2 years of data for each of the 595 
individuals. 


The corrgram command for computing autocorrelations of time-series 
data does not work for panel data. Instead, autocorrelations can be obtained 
by using the correlate command. For example, 


* First-order autocorrelation in a variable 


sort id t 
correlate lwage L.lwage 
(obs=3 ,570) 
L. 
lwage lwage 
lwage 
ae 1.0000 
Li. 0.9189 1.0000 


calculates the first-order autocorrelation coefficient for 1wage to be 0.92. 


We now calculate autocorrelations at all lags (here up to 6 periods). 
Rather than doing so for lwage, we do so for the residuals from the previous 
pooled OLS regression for 1wage. We have 


* Autocorrelations of residual 
. qui regress lwage exp exp2 wks ed, vce(cluster id) 


. predict uhat, residuals 


. forvalues j = 1/6 { 


2. qui corr uhat L`j”.uhat 
3. display "Autocorrelation at lag “j° = " %6.3f r(rho) 
4. } 

Autocorrelation at lag 1 = 0.884 

Autocorrelation at lag 2 = 0.838 

Autocorrelation at lag 3 = 0.811 

Autocorrelation at lag 4 = 0.786 

Autocorrelation at lag 5 = 0.750 

Autocorrelation at lag 6 = 0.729 


The forvalues loop leads to separate computation of each autocorrelation to 
maximize the number of observations used. If instead we gave a one-line 
command to compute the autocorrelations of uhat through Lé.uhat, then 
only 595 observations would have been used. Here 6 x 595 observations are 
used to compute the autocorrelation at lag 1, 5 x 595 observations are used to 
compute the autocorrelation at lag 2, and so on. The average of the 
autocorrelations, 0.80, provides a rough estimate of the intraclass correlation 
coefficient of the residuals. 


Clearly, the errors are serially correlated, and cluster—robust standard 
errors after pooled OLS are required. The individual-effects model provides an 
explanation for this correlation. If the error uit = a; + Eit, then even if Eit is 
iid. (0,02), we have Cor(u;it, uis) 4 0 for t 4 s if a; 4 0. The individual 
effect a; induces correlation over time for a given individual. 


The preceding estimated autocorrelations are constant across years. For 
example, the correlation of uhat with L.uhat across years 1 and 2 is assumed 
to be the same as that across years 2 and 3, years 3 and 4, ..., years 6 and 7. 
This presumes that the errors are stationary. 


In the nonstationary case, the autocorrelations will differ across pairs of 
years. For example, we consider the autocorrelations one year apart and allow 
these to differ across the year pairs. We have 


. * First-order autocorrelation differs in different year pairs 
. forvalues s = 2/7 { 


2. qui corr uhat Li.uhat if t == `s’ 
3. display "Autocorrelation at lag 1 in year `s” = " %6.3f r(rho) 
4. } 

Autocorrelation at lag 1 in year 2 = 0.915 

Autocorrelation at lag 1 in year 3 = 0.799 

Autocorrelation at lag 1 in year 4 = 0.855 

Autocorrelation at lag 1 in year 5 = 0.867 

Autocorrelation at lag 1 in year 6 = 0.894 

Autocorrelation at lag 1 in year 7 = 0.893 


The lag-1 autocorrelations for individual—year pairs range from 0.80 to 0.92, 
and their average is 0.87. From the earlier output, the lag-1 autocorrelation 
equals 0.88 when it is constrained to be equal across all year pairs. It is 
common to impose equality for simplicity. 


8.3.10 Error correlation in the RE model 


For the individual-effects model (8.1), the combined error Uit = Qi + Eit. 
The RE model assumes that ©; is 1.i.d. with a variance of o2 and that Uit is 
i.i.d. with a variance of g2. 


Then uit has a variance of Var(u;;) = 02 + 02 and a covariance of Cov 
(Uit, Uis) = 02, s # t. It follows that in the RE model, 


pa = Cor (tit, uis) = 02/ (02 +02), forals#t (8.4) 


This constant correlation is called the intraclass correlation of the error. 


The RE model therefore permits serial correlation in the model error. This 
correlation can approach 1 if the random effect is large relative to the 
idiosyncratic error, so that o? is large relative to g2. 


This serial correlation is restricted to be the same at all lags, and the 
errors Uit are then called equicorrelated or exchangeable. From section 8.3.9, 
the error correlations were, respectively, 0.88, 0.84, 0.81, 0.79, 0.75, and 0.73 
, SO a better model may be one that allows the error correlation to decrease 
with the lag length. 


8.4 Pooled or population-averaged estimators 


Pooled estimators simply regress Yit on an intercept and Xiz, using both 
between (cross-section) and within (time-series) variation in the data. 
Standard errors need to adjust for any error correlation and, given a model 
for error correlation, more efficient FGLS estimation is possible. Pooled 
estimators, called PA estimators in the statistics literature, are consistent if the 
RE model is appropriate and are inconsistent if the FE model is appropriate. 


8.4.1 Pooled OLS estimator 


The pooled OLS estimator can be motivated from the individual-effects model 
by rewriting (8.1) as the pooled model 


Yit = A+ XG + (ai — a + Eit) (8.5) 


Any time-specific effects are assumed to be fixed and already included as 
time dummies in the regressors Xit. The model (8.5) explicitly includes a 
common intercept, and the individual effects a; — a are now centered on 
Zero. 


Consistency of OLS requires that the error term (a; — a + €;4) be 
uncorrelated with Xz. So pooled OLS is consistent in the RE model but is 
inconsistent in the FE model because then ; is correlated with Xit. 


The pooled OLs estimator for our data example has already been 
presented in section 8.3.8. As emphasized there, cluster—robust standard 
errors are necessary in the common case of a short panel with independence 
across individuals. 


8.4.2 Pooled FGLS estimator or PA estimator 


Pooled feasible generalized least-squares (PFGLS) estimation can lead to 
estimators of the parameters of the pooled model (8.5) that are more efficient 


than OLS estimation. Again, we assume that any individual-level effects are 
uncorrelated with regressors, so PFGLS is consistent. 


Different assumptions about the correlation structure for the errors wit 
lead to different PFGLS estimators. In section 9.5, we present some estimators 
for long panels, using the xtgls and xtregar commands. 


Here we consider only short panels with errors independent across 
individuals. We need to model the T x T matrix of error correlations. An 
assumed correlation structure, called a working matrix, is specified, and the 
appropriate PFGLS estimator is obtained. To guard against the working matrix 
being a misspecified model of the error correlation, we compute cluster— 
robust standard errors. Better models for the error correlation lead to more 
efficient estimators, but the use of robust standard errors means that the 
estimators are not presumed to be fully efficient. 


In the statistics literature, the pooled approach is called a PA approach, 
because any individual effects are assumed to be random and are averaged 
out. The PFGLS estimator is then called the PA estimator. 


8.4.3 The xtreg, pa command 


The pooled estimator, or PA estimator, is obtained by using the xtreg 
command (see section 8.2.4) with the pa option. The two key additional 
options are corr (), to place different restrictions on the error correlations, 
and vce (robust), to obtain cluster—robust standard errors that are valid even 
if corr() does not specify the correct correlation model, provided that 
observations are independent over ¿į and N — oo. 


Let pis = Cor (Uit, Uis), the error correlation over time for individual į, 
and note the restriction that Pts does not vary with 7. The corr () options all 
set p: = 1 but differ in the model for Pts for t Æ s. With T time periods, the 
correlation matrix is T x T, and there are potentially as many as 
T(T — 1)/2 unique off-diagonal entries. 


The corr (independent) option sets pts = 0 for s Æ t. Then the PA 
estimator equals the pooled OLS estimator. 


The corr (exchangeable) option sets pts = P for all s Æ t so that errors 
are assumed to be equicorrelated. This assumption is imposed by the RE 
model (see section 8.3.10), and as a result, xt reg, pa with this option is 
asymptotically equivalent to xtreg, re. 


For panel data, it is often the case that the error correlation pts declines 
as the time difference |t — s| increases—the application in section 8.3.9 
provided an example. The corr(ar k) option models this dampening by 
assuming an autoregressive process of order k, or AR(k) process, for Wit. For 
example, corr (ar 1) assumes that Uit = P1Ui,t—1 + Eit, which implies that 
ie = pti. The corr(stationary g) option instead uses a moving-average 
process, or MA(9) process. This sets Pts = P\t—s| if |t — s| < g and pts = Oif 
jt — s| > g. 


The corr (unstructured) option places no restrictions on Pts, aside from 
equality of Pi,ts across individuals. Then 
Cov(uit, uis) = 1/N Y; (Giz — t) (Gis — s) . For small 7, this may be the 
best model, but for larger T, the method can fail numerically because there 
are T(T — 1)/2 unique parameters pts to estimate. The 
corr (nonstationary g) option allows Pts to be unrestricted if |t — s| < g 
and sets p+s = 0 if |t — s| > g, so there are fewer correlation parameters to 
estimate. 


The Pa estimator is also called the generalized estimating equations 
estimator in the statistics literature. The xtreg, pa command is the special 
case of xtgee with the family (gaussian) option. The more general xtgee 
command, presented in section 22.4.4, has other options that permit 
application to a wide range of nonlinear panel models. 


8.4.4 Application of the xtreg, pa command 


As an example, we specify an AR(2) error process. We have 


. * Population-averaged or pooled FGLS estimator with AR(2) error 


. xtreg lwage exp exp2 wks ed, pa 


GEE population-averaged model 
Group and time vars: id t 
Family: Gaussian 

Identity 
Correlation: AR(2) 


Link: 


Scale parameter 


lwage 


exp 
exp2 
wks 
ed 
_cons 


= .1966639 


Coefficient 


.0718915 
-. 0008966 
. 0002964 
. 0905069 
4.526381 


Robust 


std. err. 


. 003999 
. 0000933 
.0010553 
.0060161 
. 1056897 


(Std. err. 


42.83 


corr(ar 2) vce(robust) nolog 


Number of obs 
Number of groups 
Obs per group: 


min = 


avg 

max 
Wald chi2(4) 
Prob > chi2 


adjusted for clusteri 


P>|z| [95% conf. 
0.000 .0640535 
0.000 -.0010794 
0.779 -.001772 
0.000 .0787156 
0.000 4.319233 


= 4,165 


ou 
o œ 
-N 
o w 
oO. 
On 
5o œ 


ng on id) 


interval] 


.0797294 
-.0007137 
. 0023647 
. 1022982 
4.733529 


The coefficients change considerably compared with the coefficients from 
pooled ors. The cluster—robust standard errors are smaller than those from 
pooled ors for all regressors except ea, illustrating the desired improved 
efficiency because of better modeling of the error correlations. Note that 
unlike the pure time-series case, controlling for autocorrelation does not lead 
to the loss of initial observations. 


The estimated correlation matrix is stored in e (R). We have 


. * Estimated error correlation matrix after xtreg, pa 
. matrix list e(R) 


symmetric e(R)[7,7] 


ri 
r2 
r3 
r4 
r5 
r6 
rT 


cl 
1 


. 89722058 
. 84308581 
. 78392846 
. 73064474 


. 6806209 


. 63409777 


c2 


1 


. 89722058 
. 84308581 
. 78392846 
. 73064474 


. 6806209 


c3 


1 


. 89722058 
. 84308581 
. 78392846 
. 13064474 


c4 


1 


. 89722058 
. 84308581 
. 78392846 


c5 c6 


1 
. 89722058 1 
.84308581 .89722058 


c7 


By comparison, from section 8.3.9, the autocorrelations of the errors after 
pooled OLS estimation were 0.88, 0.84, 0.81, 0.79, 0.75, and 0.73. 


In an end-of-chapter exercise, we compare estimates obtained using 
different error-correlation structures. 


8.5 Fixed-effects or within estimator 


Estimators of the parameters G of the FE model (8.1) must remove the fixed 
effects a;. The within transform does so by mean-differencing. The within 
estimator performs OLS on the mean-differenced data. Because all the 
observations of the mean-difference of a time-invariant variable are zero, we 
cannot estimate the coefficient on a time-invariant variable. 


Because the within estimator provides a consistent estimate of the FE 
model, it is often called the FE estimator, though the first-difference (FD) 
estimator given in section 8.9 also provides consistent estimates in the FE 
model. The within estimator is also consistent under the RE model, but 
alternative estimators are more efficient in the RE model. 


8.5.1 Within estimator 


The fixed effects a; in the model (8.1) can be eliminated by subtraction of 
the corresponding model for individual means 7, = x;'3 + =;, leading to the 
within model or mean-difference model 


(Yit — Ji) = (Xie — Xi)’ B + (Cit — Ei) (8.6) 


where, for example, x; = T;"' an x,,, and only regressors that have 
within variation are included in (8.6) because otherwise £i — Zi: = 0. The 
within estimator is the OLS estimator of this model. 


Because @; has been eliminated, OLS leads to consistent estimates of G 
even if &; is correlated with Xiz, as is the case in the FE model. This result is 
a great advantage of panel data. Consistent estimation is possible even with 
endogenous regressors Xj, provided that X:+ is correlated only with the time- 
invariant component of the error, @;, and not with the time-varying 
component of the error, Eit. 


This desirable property of consistent parameter estimation in the FE 
model is tempered, however, by the inability to estimate the coefficients or a 


time-invariant regressor. Also the within estimator will be relatively 
imprecise for time-varying regressors that vary little over time. 


Stata actually fits the model 


(Yit =U; +4) =Q + (Xit — X; +X) 8 | (Eit E; 4 E) (8.7) 


where, for example, y = (1/N)y; is the grand mean of yit. This 
parameterization has the advantage of providing an intercept estimate, the 
average of the individual effects a;, while yielding the same slope estimate 
B as that from the within model. 


8.5.2 The xtreg, fe command 


The within estimator is computed by using the xt reg command (see 
section 8.2.4) with the fe option. The default standard errors assume that 
after one controls for @;, the error €it is 1.1.d. The vce (robust) option, 
equivalent to vce (cluster id), relaxes this assumption and provides 
cluster—robust standard errors, provided that observations are independent 
over į and N => oo. 


8.5.3 Application of the xtreg, fe command 


For our data, we obtain 


. * Within or FE estimator with cluster--robust standard errors 
. xtreg lwage exp exp2 wks ed, fe vce(cluster id) 
note: ed omitted because of collinearity. 


Fixed-effects (within) regression 


Group variable: 


R-squared: 
Within 
Between 
Overall 


corr(u_i, Xb) = -0.9107 


lwage 


exp 
exp2 
wks 
ed 
_cons 


sigma_u 
sigma_e 
rho 


Coefficient 


. 1137879 
- .0004244 
. 0008359 
(0) 
4.596396 


1.0362039 
. 15220316 
. 97888036 


(Std. err. 


Robust 
std. err. t 


. 0040289 28.24 


. 0000822 -5.16 
. 0008697 0.96 
(omitted) 


. 0600887 76.49 


Number of obs = 


Number of groups 


Obs per group: 


F(3,594) 
Prob > F 


min = 
avg 
max = 


adjusted for 595 clusters in id) 


P>|t | 
0.000 
0.000 
0.337 


0.000 


[95% conf. 

. 1058753 
-.0005858 
-.0008721 


4.478384 


(fraction of variance due to u_i) 


interval] 
. 1217004 
-.0002629 
.0025439 


4.714408 


Compared with the standard errors of pooled OLS, the standard errors here 

have roughly tripled because only within variation of the data is being used. 
The sigma _u and sigma _e entries are explained in section 8.8.1, and the R? 
measures are explained in section 8.8.2. 


It is imperative to note that inclusion of fixed effects controls for only 


some of the within cluster correlation. From output not included, the 


corresponding default standard errors are roughly 40% too small. They 
equal, respectively, 0.00247, 0.00005, 0.00061, and 0.03891. 


A striking result is that the coefficient for education is not identified. 
This is because the data on education is time invariant. In fact, given that we 
knew from the xtsum output in section 8.3.4 that ed had zero within standard 
deviation, we should not have included it as one of the regressors in the 
xtreg, fe command. 


This is unfortunate because how wages depend on education is of great 
policy interest. It is certainly endogenous because people with high ability 
are likely to have on average both high education and high wages. 
Alternative panel-data methods to control for endogeneity of the ed variable 
are presented in chapter 9. In other panel applications, endogenous 
regressors may be time varying, and the within estimator will suffice. 


8.5.4 Least-squares dummy-variable regression 


It is best to use the xtreg, fe command to fit the FE model. For 
completeness, we present an alternative method for obtaining consistent 
estimates of 6 in the FE model. 


The Frisch—Waugh-—Lovell theorem states that for the model 
y = X18, + X28, + u, the estimate B» obtained from OLS regression of y 
on both X; and X, is equivalent to the estimate B» obtained by OLS 
regression of y on just X,, where X, is a matrix of residuals obtained from 
OLS regression of each column of Xs on Xj. 


The least-squares dummy-variable (LSDv) model introduces N 
individual-specific indicator variables (or dummy variables) dj it, 
j=1,...,N, where d; ;, = 1 for the 7¢th observation if 7 = 1 and dj it = 0 
otherwise. Then, fit by oLs the model 


N 
Yit = > aid; it | + Xub + Eit (8.8) 


g=1 


By the Frisch-Waugh—Lovell theorem, the resulting OLs estimate of 3 can 
equivalently be found by first OLS regressing each component of x; on 
diit, -- - , dN,it and then obtaining @ by OLS regression of yit on these 
residuals. But these residuals can be shown to equal X; — X;, so the LSDV 
estimate of 6 is equivalent to the within estimator. (The terminology “FE 
estimator” arises because in the LSDv method, the individual-specific effects 
Q; are treated as fixed quantities to be estimated.) 


Direct estimation of (8.8) is unnecessarily computationally expensive 
because an (N + K) x (N + K) matrix needs to be inverted. The areg 
command with option absorb () avoids this by implementing the Frisch— 
Waugh—Lovell method, so it reports only the estimates of the parameters 8. 
We have 


. * LSDV model fit using areg with cluster--robust standard errors 
. areg lwage exp exp2 wks ed, absorb(id) vce(cluster id) 
note: ed omitted because of collinearity. 


Linear regression, absorbing indicators Number of obs = 4,165 
Absorbed variable: id No. of categories = 595 
F(3, 594) = 908.44 
Prob > F = 0.0000 
R-squared = 0.9068 
Adj R-squared = 0.8912 
Root MSE = 0.1522 


(Std. err. adjusted for 595 clusters in id) 


Robust 
lwage | Coefficient std. err. t P>|t| [95% conf. interval] 
exp . 1137879 0043514 26.15 0.000 . 1052418 . 1223339 
exp2 -.0004244 . 0000888 -4.78 0.000 -.0005988 -.00025 
wks . 0008359 . 0009393 0.89 0.374 -.0010089 . 0026806 
ed O (omitted) 
_cons 4.596396 .0648993 70.82 0.000 4.468936 4.723856 


The coefficient estimates are the same as those from xtreg, fe. Because of 
different small-sample correction (see section 6.6.4), the cluster-robust 


standard errors following areg are approximately ,/T/(T — 1) = ,/7/6 
times larger than those from xtreg, fe, and the xtreg, fe standard errors 


should be used. This difference arises because inference for areg is designed 
for the case where N is fixed and T — oo, whereas we are considering the 
short-panel case, where T is fixed and N — oo. 


For completeness, we use the regress command to directly fit the 
LSDV model (8.8) by including a set of indicator variables for each individual 
by inserting the i. operator before the categorical variable id. Because there 
are N + K regressors, the default setting of matsize may need to be 
increased if Stata 15 or earlier is used, and the output from regress is very 
long because it includes coefficients for all the dummy variables. We instead 


suppress the output and use estimates table to list results for just the 
coefficients of interest. 


. * LSDV model fit using factor variables with cluster--robust standard errors 
. qui regress lwage exp exp2 wks ed i.id, vce(cluster id) 


. estimates table, keep(exp exp2 wks ed _cons) b se b(%12.7f) 


Variable Active 
exp 0.1137879 
0.0043514 
exp2 -0.0004244 
0.0000888 
wks 0.0008359 
0.0009393 
ed 0.1022134 
0.0046744 
_cons 4.3476807 
0.0443191 


Legend: b/se 


The coefficient estimates and standard errors are exactly the same as those 
obtained from areg, aside from the constant. For areg (and xtreg, fe), the 
intercept is fit so that y — X 8 = 0, whereas this is not the case using 
regress. The cluster—robust standard errors are the same as those from areg, 
and as already noted, those from xtreg, fe should be used. 


The complete output includes estimates q, = J; — x’ 3, i =1,...,N. 
Note that &; is not consistently estimated in short panels, because it 
essentially relies on only T; observations used to form y; and X;, but @ is 
nonetheless consistently estimated. 


For a short panel, it is much slower to use the LSDv method because it 
requires inverting an (N + K) x (N + K) matrix, whereas xtreg, fe or 
areg inverts a much smaller kK x K matrix. And, as noted, xtreg, fe gives 
correct cluster—robust standard errors. 


8.5.5 Two-way fixed effects 


For short panels, the most common case of two-way fixed effects is to have 
individual fixed effects and time fixed effects. Then simply give a command 


such as xtreg yx i.time, fe, where time is the time variable. Then a 
(kK +T) x (K +T) matrix needs to be inverted because the FE estimator 
eliminates the N individual fixed effects. 


In some cases, there may be need to add many fixed effects, in addition 
to the individual fixed effects. Then the methods and community-contributed 
commands such as reg2hdfe and felsdvreg presented in section 6.6.6 may 
be used. 


8.6 Between estimator 


The between estimator uses only between or cross-sectional variation in the 
data and is the OLS estimator from the regression of y; on x;. Because only 
cross-sectional variation in the data is used, the coefficients of any 
individual-invariant regressors, such as time dummies, cannot be identified. 
We provide the estimator for completeness, even though it is seldom used 
because pooled estimators and the RE estimator are more efficient. 


8.6.1 Between estimator 
The between estimator is inconsistent in the FE model and is consistent in the 


RE model. To see this, average the individual-effects model (8.1) to obtain 
the between model 


Yi —a+*x,'B (ai Qa 4 Ei) 


The between estimator is the OLS estimator in this model. Consistency 
requires that the error term (a; — a + €;) be uncorrelated with x;z. This is 
the case if @; is a random effect but not if @; is a fixed effect. 


8.6.2 Application of the xtreg, be command 


The between estimator is obtained by specifying the be option of the xt reg 
command. There is no explicit option to obtain heteroskedasticity-robust 
standard errors, but these can be obtained by using the vce (boot strap) 
option. 


For our data, the bootstrap standard errors differ from the default by only 
10% because averages are used so that the complication is one of 
heteroskedastic errors rather than clustered errors. We report the default 
standard errors that are much more quickly computed. We have 


* Between estimator with default standard errors 
. xtreg lwage exp exp2 wks ed, be 


Between regression (regression on group means) Number of obs = 4,165 

Group variable: id Number of groups = 595 
R-squared: Obs per group: 

Within = 0.1357 min = 7 

Between = 0.3264 avg = 7.0 

Overall = 0.2723 max = 7 

F(4,590) = 71.48 

sd(u_i + avg(e_i.)) = .324656 Prob > F 0.0000 

lwage | Coefficient Std. err. t P>|tl [95% conf. interval] 

exp .038153 . 0056967 6.70 0.000 .0269647 0493412 

exp2 -.0006313 .0001257 -5.02 0.000 -.0008781 -.0003844 

wks .0130903 . 0040659 3.22 0.001 .0051048 .0210757 

ed .0737838 . 0048985 15.06 0.000 .0641632 . 0834044 

_cons 4.683039 . 2100989 22.29 0.000 4.270407 5.095672 


The estimates and standard errors are closer to those obtained from pooled 


OLS than those obtained from within estimation. 


8.7 Random-effects estimator 
The RE estimator is the FGLS estimator in the RE model (8.1) under the assumption 
that the random effect @; is 1.1.d. and the idiosyncratic error £t is i.i.d. The RE 


estimator is consistent if the RE model is appropriate and is inconsistent if the FE 
model is appropriate. 


8.7.1 RE estimator 


The RE model is the individual-effects model (8.1) 
Yit = Xab + (Qi + Ext) (8.9) 


with a; ~ (a, o2) and e; ~ (0, a2). Then from (8.4), the combined error 
Wit = Qi + Eit 1S correlated over ¢ for given 7 with 


Cor(uiz, uis) = 02/(o2 +02), foralls #t (8.10) 
The RE estimator is the FGLS estimator of G in (8.9) given (8.10) for the error 
correlations. 
In several different settings, such as heteroskedastic errors and AR(1) errors, the 
FGLS estimator can be calculated as the OLS estimator in a model transformed to 
have homoskedastic uncorrelated errors. This is also possible here. Some 


considerable algebra shows that the RE estimator can be obtained by OLS estimation 
in the transformed model 


(vie - m) = (1-8) a + (xu - Azi) B+ { (1-8) at (cx — 6) } 610 


where 0; is a consistent estimate of 


0; = 1 — Vo2/(Tio2 + 02) 


The RE estimator is consistent and fully efficient if the RE model is appropriate. 
It is inconsistent if the FE model is appropriate because then correlation between Xit 
and a; implies correlation between the regressors and the error in (8.11). Also, if 
there are no fixed effects but the errors exhibit within-panel correlation, then the RE 
estimator is consistent but inefficient, and cluster-robust standard errors should be 
obtained. 


The RE estimator uses both between and within variation in the data and has 
special cases of pooled OLS (6; — 0) and within estimation (6, — 1). The RE 
estimator approaches the within estimator as T gets large and as g2 gets large 
relative to g? because in those cases 0; EEP 


8.7.2 The xtreg, re command 


Three closely related and asymptotically equivalent RE estimators can be obtained 
by using the xt reg command (see section 8.2.4) with the re, mle, or pa option. 
These estimators use different estimates of the variance components g2 and g? and 
hence different estimates 0; in the RE regression; see [XT] xtreg for the formulas. 


The RE estimator uses unbiased estimates of the variance components and is 
obtained by using the re option. The maximum likelihood estimator, under the 
additional assumption of normally distributed œ; and €iz, is computed by using the 
mle option. The RE model implies the errors are equicorrelated or exchangeable (see 
section 8.3.10), so xtreg with the pa and corr (exchangeable) options yields 
asymptotically equivalent results. 


For panel data, the RE estimator assumption of equicorrelated errors is usually 
too strong. At the least, use the vce(cluster id) option to obtain cluster—robust 
standard errors. And more efficient estimates can be obtained with xtreg, pa witha 
better error structure than those obtained with the corr (exchangeable) option. 


8.7.3 Application of the xtreg, re command 


For our data, xt reg, re yields 


. * RE estimator with cluster--robust standard errors 
. xtreg lwage exp exp2 wks ed, re vce(cluster id) theta 


Random-effects GLS regression Number of obs = 4,165 
Group variable: id Number of groups = 595 
R-squared: Obs per group: 
Within = 0.6340 min = 7 
Between = 0.1716 avg = 7.0 
Overall = 0.1830 max = 7 
Wald chi2(4) = 1598.50 
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 
theta = .82280511 


(Std. err. adjusted for 595 clusters in id) 


Robust 
lwage | Coefficient std. err. Zz P>lz| [95% conf. interval] 
exp . 0888609 . 0039992 22.22 0.000 .0810227 .0966992 
exp2 -.0007726 . 0000896 -8.62 0.000 -.0009481 - .000597 
wks . 0009658 . 0009259 1.04 0.297 -. 000849 . 0027806 
ed .1117099 . 0083954 13.31 0.000 .0952552 . 1281647 
_cons 3.829366 . 1333931 28.71 0.000 3.567921 4.090812 
sigma_u . 31951859 
sigma_e . 15220316 
rho 81505521 (fraction of variance due to u_i) 


Unlike the within estimator, the coefficient of the time-invariant regressor ed is now 
estimated. The standard errors are somewhat smaller than those for the within 
estimator because some between variation is also used. The entries sigma_u, 
sigma_e, and rho, and the various R? measures, are explained in the next section. 


The re, mle, and pa corr (exchangeable) options of xtreg yield asymptotically 
equivalent estimators that differ in typical sample sizes. Comparison for these data 
is left as an exercise. 


8.7.4 Correlated RE model 


Some econometricians have suggested reasons for narrowing the distinction 
between RE and FE models by relaxing the assumption that the random effect (@;) is 
purely random and uncorrelated with exogenous variables X;. Instead, an additional 
assumption makes the random effect a linear function of observable exogenous 
variables plus an error term. 


A leading example of this approach, due to Mundlak (1978), and often referred 
to as a “Mundlak correction”, assumes F(a;|7j1,..., tir) = X,Y, Where X4; are 


the time averages of the subset of regressors that have within variation, and defines 
the individual specific-effect 


Oo, = xy F 


where 7; is an independent error. Then the model (8.9) becomes 


Yet = Xub + Kye + (mu + eit) 


This model, called a correlated RE model, is interpreted as an RE model in which the 
RE assumptions hold conditionally on both x; and Xı;. The addition of the extra 
controls could make the RE assumption more acceptable. 


This approach is especially useful for those nonlinear panel models for which 
there is no standard FE estimator in short panels because of the incidental parameters 
problem; see section 22.2.1. 


In the case of linear regression, RE estimation (and OLS estimation) with the 
Mundlak correction actually yields the same estimates as FE estimation. We have 


. * Mundlak correction: RE with individual-specific means added as regressors 
. sort id 


. foreach x of varlist exp exp2 wks ed { 
2. by id: egen mean`x’ = mean(~x”) 
3. } 


. xtreg lwage exp exp2 wks ed meanexp meanexp2 meanwks meaned, re vce(cluster id) 
note: meaned omitted because of collinearity. 


Random-effects GLS regression Number of obs = 4,165 

Group variable: id Number of groups = 595 
R-squared: Obs per group: 

Within = 0.6566 min = 7 

Between = 0.3264 avg = 7.0 

Overall = 0.4160 max = 7 

Wald chi2(7) = 3237.73 


O (assumed) Prob > chi2 = 0.0000 
(Std. err. adjusted for 595 clusters in id) 


corr(u_i, X) 


Robust 
lwage | Coefficient std. err. Zz P>lz| [95% conf. interval] 
exp . 1137879 . 0040308 28.23 0.000 . 1058876 . 1216881 
exp2 -.0004244 .0000823 -5.16 0.000 -.0005856 - .0002632 
wks .0008359 .0008701 0.96 0.337 -.0008695 .0025412 
ed .0737838 .0052096 14.16 0.000 . 0635732 . 0839943 
meanexp -.0756349 .0075336 -10.04 0.000 -.0904005 - .0608693 
meanexp2 - .0002069 .0001695 -1.22 0.222 - .0005392 .0001254 
meanwks 0122544 . 0043588 2.81 0.005 .0037113 .0207975 
meaned O (omitted) 
_cons 4.683039 . 2443095 19.17 0.000 4.204201 5.161877 
sigma_u .31951859 
sigma_e . 15220316 
rho -81505521 (fraction of variance due to u_i) 


The coefficients of the individual-varying and time-varying regressors exp, exp2, 
and wks are identical to the FE estimates given in section 8.5, and the associated 
standard errors are very close. As explained earlier, the coefficient of meaned, the 
mean of the time-invariant regressor ed, is not identified, and this variable could 
have been omitted. 


Wooldridge (2021) extends the Mundlak regression to additionally include as 
regressors time-period specific averages X2; and shows equivalence to the two-way 
FE estimator that includes both individual and time-specific fixed effects. 


8.8 Comparison of estimators 


Output from xt reg includes estimates of the standard deviation of the error 
components and R2 measures that measure within, between, and overall fit. 
Prediction is possible using the predict postestimation command. We 
present these estimates before turning to comparison of OLS, between, RE, 
and within estimators. 


8.8.1 Estimates of variance components 


Output from the fe, re, and mle options of xt reg includes estimates of the 
standard deviations of the error components. The combined error in the 
individual-effects model that we label a; + €;; 1s referred to as u; + ej in 
the Stata documentation and output. Thus, Stata output sigma_u gives the 
standard deviation of the individual effect a;, and sigma_e gives the 
standard deviation of the idiosyncratic error £it. 


For the RE model estimates given in the previous section, the estimated 
standard deviation of a; is twice that of £;it. So the individual-specific 
component of the error (the random effect) is much more important than the 
idiosyncratic error. 


The output labeled rho equals the intraclass correlation of the error pu 
defined in (8.4). For the RE model, for example, the estimate of 0.815 is very 
high. This is expected because, from section 8.3.9, the average 
autocorrelation of the OLS residuals was computed to be around 0.80. 


The theta option, available for the re option in the case of balanced 
data, reports the estimate 9, = 0. Because 6 = 0.823» here the RE estimates 
will be much closer to the within estimates than to the OLS estimates. More 
generally, in the unbalanced case, the matrix e (theta) saves the minimum, 
5th percentile, median, 95th percentile, and maximum of 6,,... , ĝy- 


8.8.2 Within and between R2 


The table header from xt reg provides three R2 measures, computed using 
the interpretation of R2 as the squared correlation between the actual and 
fitted values of the dependent variable, where the fitted values ignore the 
contribution of &;. 


Let & and 8 be estimates obtained by one of the xt reg options (be, fe, 
or re). Let r?(x, y) denote the squared correlation between x and y. Then, 


Within R?: r? (Ya =T:); (x48 = x/) } 
Between R2: r? Jo XB) 
Overall R?: r? (yi, xh) 


The three R2 measures are, respectively, 0.66, 0.03, and 0.05 for the 
within estimator; 0.14, 0.33, and 0.27 for the between estimator; and 0.63, 
0.17, and 0.18 for the RE estimator. So the within estimator best explains the 
within variation (R2, = 0.66), and the between estimator best explains the 
between variation (R? = 0.33). The within estimator has a low R2 = 0.05 
and a much higher R? — 0.91 in section 8.5.4 because R? neglects q;. 


8.8.3 Estimator comparison 


We compare some of the panel estimators and associated standard errors, 
variance components estimates, and R2. Pooled OLS is the same as the xt reg 
command with the corr (independent) and pa options. We have 


* Compare OLS, BE, FE, RE estimators, and methods to compute standard errors 
global xlist exp exp2 wks ed 


qui regress lwage $xlist, vce(cluster id) 
estimates store OLS_rob 

qui xtreg lwage $xlist, be 

estimates store BE 

qui xtreg lwage $xlist, fe 

estimates store FE 

qui xtreg lwage $xlist, fe vce(robust) 
estimates store FE_rob 

qui xtreg lwage $xlist, re 

estimates store RE 

qui xtreg lwage $xlist, re vce(robust) 


estimates store RE_rob 


. estimates table OLS_rob BE FE FE_rob RE RE_rob, 


> 


b se stats(N r2 r2_o r2_b r2_w sigma_u sigma_e rho) b(%7.4f) 


Variable OLS_rob BE FE FE_rob RE 
exp 0.0447 0.0382 0.1138 0.1138 0.0889 
0.0054 0.0057 0.0025 0.0040 0.0028 
exp2 -0.0007 -0.0006 -0.0004 -0.0004 -0.0008 
0.0001 0.0001 0.0001 0.0001 0.0001 
wks 0.0058 0.0131 0.0008 0.0008 0.0010 
0.0019 0.0041 0.0006 0.0009 0.0007 
ed 0.0760 0.0738 (omitted) (omitted) 0.1117 
0.0052 0.0049 0.0061 
_cons 4.9080 4.6830 4.5964 4.5964 3.8294 
0.1400 0.2101 0.0389 0.0601 0.0936 
N 4165 4165 4165 4165 4165 
r2 0.2836 0.3264 0.6566 0.6566 
r2_o 0.2723 0.0476 0.0476 0.1830 
r2_b 0.3264 0.0276 0.0276 0.1716 
r2_w 0.1357 0.6566 0.6566 0.6340 
sigma_u 1.0362 1.0362 0.3195 
sigma_e 0.1522 0.1522 0.1522 
rho 0.9789 0.9789 0.8151 
Legend: b/se 
Variable RE_rob 
exp 0.0889 
0.0040 
exp2 -0.0008 
0.0001 
wks 0.0010 
0.0009 
ed 0.1117 
0.0084 
_cons 3.8294 
0.1334 
N 4165 
r2 
r2_o 0.1830 
r2_b 0.1716 
r2_w 0.6340 
sigma_u 0.3195 
sigma_e 0.1522 
rho 0.8151 


Legend: b/se 


Several features emerge. The estimated coefficients vary considerably 
across estimators, especially for the time-varying regressors. This reflects 
quite different results according to whether within variation or between 
variation is used. The within estimator did not provide a coefficient estimate 
for the time-invariant regressor ed (with the coefficient reported as 0.00). 
Cluster—robust standard errors for the FE and RE models exceed the default 
standard errors by one-third to one-half. The various R2 measures and 
variance-components estimates also vary considerably across models. 


8.8.4 FE versus RE 


The essential distinction in microeconometrics analysis of panel data is that 
between FE and RE models. If effects are fixed, then the pooled OLS and RE 
estimators are inconsistent, and instead the within (or FE) estimator needs to 
be used. The within estimator is otherwise less desirable because using only 
within variation leads to less efficient estimation and inability to estimate 
coefficients of time-invariant regressors. 


To understand this distinction, consider the scalar regression of Yit on Tit 
. Consistency of the pooled OLs estimator requires that FE (u;t|£i¢) = 0 in the 
model y; = a + Bxiz + uit. If this assumption fails so that Viz is 
endogenous, IV estimation can yield consistent estimates. It can be difficult 
to find an instrument Zit for Tit that satisfies E (uit|zit) = 0. 


Panel data provide an alternative way to obtain consistent estimates. 
Introduce the individual-effects model y;, = a; + Gxiz + £i Consistency in 
this model requires the weaker assumption that E'(€;,|a;, £it) = 0. 
Essentially, the error has two components: the time-invariant component ; 
correlated with regressors that we can eliminate through differencing and a 
time-varying component that, given ;, is uncorrelated with regressors. 


The RE model adds an additional assumption to the individual-effects 
model: ©; is distributed independently of Zit. This is a much stronger 
assumption because it implies that E (ei|&i, viz) = E (Eit|£it), so 
consistency requires that F'(€;,|x;,) = 0, as assumed by the pooled OLs 
model. 


For individual-effects models, the fundamental issue is whether the 
individual effect is correlated with regressors. 


8.8.5 Hausman test for FE 


Under the null hypothesis that individual effects are random, these 
estimators should be similar because both are consistent. Under the 
alternative, these estimators diverge. This juxtaposition is a natural setting 
for a Hausman test (see section 11.9), comparing FE and RE estimators. The 
test compares the estimable coefficients of time-varying regressors or can be 
applied to a key subset of these (often one key regressor). 


The hausman command 


For completeness we begin with the hausman command, which implements 
the standard form of the Hausman test. This command should be used only if 
inference is based on default standard errors. 


We have already stored the within estimates as FE and the RE estimates as 
RE, SO we can immediately implement the test. The default version of the 
hausman FE RE command leads to a variance estimate 
{V (Bre) — V (Bre)} that for these data is negative definite, so estimated 
standard errors of ( B; FE = B;.RE) cannot be obtained. This problem can 
arise because different estimates of the error variance are used in forming 
V (Bre) and F (Brp) Similar issues arise for a Hausman test comparing 
OLS and two-stage least-squares estimates. 


This problem can be avoided by using the asymptotically equivalent 
sigmamore option, which specifies that both covariance matrices are based 
on the (same) estimated disturbance variance from the efficient estimator. 
We obtain 


. * Hausman test assuming RE estimator is fully efficient under null hypothesis 
. hausman FE RE, sigmamore 


—— Coefficients 
(b) (B) (b-B) sqrt (diag (V_b-V_B) ) 
FE RE Difference Std. err. 
exp .1137879 .0888609 .0249269 .0012778 
exp2 -.0004244 - .0007726 .0003482 .0000285 
wks .0008359 .0009658 -.0001299 .0001108 


b = Consistent under HO and Ha; obtained from xtreg. 
B = Inconsistent under Ha, efficient under HO; obtained from xtreg. 


Test of HO: Difference in coefficients not systematic 
chi2(3) = (b-B) ~[(V_b-V_B)*(-1)] (b-B) 


1513.02 
0.0000 


Prob > chi2 


The output from hausman provides a nice side-by-side comparison. For the 
coefficient of regressor exp, for example, a test of RE against FE yields 

t = 0.0249/0.00128 = 19.5, a highly statistically significant difference. And 
the overall statistic, here y?(3), has p = 0.000. This leads to strong rejection 
of the null hypothesis that RE provides consistent estimates. 


Cluster—robust Hausman test 


A serious shortcoming of the standard Hausman test is that it requires the RE 
estimator to be efficient. This in turn requires that the a; and €;z be 1.1.d., an 
invalid assumption if cluster—robust standard errors for the RE estimator 
differ substantially from default standard errors. For our data example, and 
in many applications, a robust version of the Hausman test is needed. There 
is no official Stata command for this. A panel bootstrap Hausman test can be 
conducted, using an adaptation of the bootstrap Hausman test example in 
section 12.4.6. 


Simpler is to test Ho: ~y = 0 in the auxiliary OLS regression 


Yet = Xp B+ Xay + vie 


where xX denotes only time-varying regressors. A Wald test of y = 0 can be 
shown to be asymptotically equivalent to the standard test when the RE 


estimator is fully efficient under Hp. A summary of related tests for FE 
versus RE is given in Baltagi (2021, 85-92). 


In the more likely case that the RE estimator is not fully efficient, 
Wooldridge (2010, 332—323) proposes performing the Wald test using 
cluster-robust standard errors. The individual averages X4; have already 
been created for the correlated in section 8.7.4. We have 


. * Cluster-robust Hausman test using method of Wooldridge (2010) 
. qui regress lwage $xlist meanexp meanexp2 meanwks, vce(cluster id) 


. test meanexp meanexp2 meanwks 
( 1) meanexp = 0 


( 2) meanexp2 = 0 
( 3) meanwks = 0 


F( 3, 594) 
Prob > F 


597.47 
0.0000 


The test strongly rejects the null hypothesis, and we conclude that the RE 
model is not appropriate. 


The community-contributed command xt overid (Schaffer and 
Stillman 2006) implements the preceding test following command xt reg, 
re vce (robust). We have 


. * Cluster-robust Hausman test using xtovierid command 
. qui xtreg lwage $xlist, re vce(robust) 


. xtoverid 


Test of overidentifying restrictions: fixed vs random effects 
Cross-section time-series model: xtreg re robust cluster(id) 
Sargan-Hansen statistic 1792.412 Chi-sq(3) P-value = 0.0000 


This presents a y? (q) version of the test that equals q times the preceding F- 
test statistic; here 3 x 597.47 = 1792.41. 


8.8.6 Prediction 
The predict postestimation command after xt reg provides estimated 


residuals and fitted values following estimation of the individual-effects 
model Yit = Qi + xB + Eit. 


The estimated individual-specific error &; = 7, — X! A is obtained by 
using the u option; the estimated idiosyncratic error & = y; — @; — x’, B is 
obtained by using the e option; and the ue option gives &; + &+. 


Fitted values of the dependent variable differ according to whether the 
estimated individual-specific error is used. The fitted value y;, = @ + x! iB 
where @ = N~!5>. Gj, is obtained by using the xb option. The fitted value 
Yit = Qi + x! Ne is obtained by using the xbu option. 


As an example, we contrast OLs and RE in-sample fitted values. 


* Prediction after OLS and RE estimation 
. qui regress lwage exp exp2 wks ed, vce(cluster id) 


. predict xbols, xb 
. qui xtreg lwage exp exp2 wks ed, re 
. predict xbre, xb 


. predict xbure, xbu 


summarize lwage xbols xbre xbure 


Variable Obs Mean Std. dev. Min Max 
lwage 4,165 6.676346 .4615122 4.60517 8.537 
xbols 4,165 6.676346 .2457572 5.850037 7.200861 

xbre 4,165 6.676346 .6205324 5.028067 8.22958 
xbure 4,165 6.676346 .4082951 5.29993 7.968179 
correlate lwage xbols xbre xbure 
(obs=4, 165) 
lwage xbols xbre xbure 

lwage 1.0000 

xbols 0.5325 1.0000 

xbre 0.4278 0.8034 1.0000 

xbure 0.9375 0.6019 0.4836 1.0000 


The RE prediction @ + x’, B is not as highly correlated with 1wage as is the 


OLS prediction (0.43 versus 0.53), which was expected because the OLS 
estimator maximizes this correlation. 


When instead we use &; + x’ B so the fitted individual effect is 


included, the correlation of the prediction with 1wage increases greatly to 
0.94. In a short panel, however, these predictions are not consistent, because 


each individual prediction &; = 7, — X! B is based on only T observations 
and T + co. 


8.9 First-difference estimator 


Consistent estimation of 8 in the FE model requires eliminating the a;. One 
way to do so is to mean-difference, yielding the within estimator. An 
alternative way is to FD, leading to the FD estimator. 


For analysis of static models, the FE estimator is traditionally favored for 
two reasons. First, in a balanced panel, it is generally more efficient because 
it is the efficient estimator if the €:z are i.1.d., whereas the FD estimator is 
efficient under the less realistic assumption that (€;, — €;,4-1) is iid. 
Second, with unbalanced data more observations are lost using the FD 
estimator. For example, if T = 4 and for individual ¿į only observations 1 and 
3 are available, then we cannot compute (yit — yYi,t—1 ) for any t, but we can 
compute (yi3 — Yı), where Y; = (ya + y:3)/2. 


The FD estimator has the advantage of relying on weaker exogeneity 
assumptions, explained below, that become important in dynamic models, 
with lagged values of yit as regressor, presented in the next chapter. 


8.9.1 FD estimator 


The FD estimator is obtained by performing OLs on the first-differenced 
variables 


(Yit — Yit—-1) = (Xit — Xit—1) 8 +(e Epei (8.12) 


First-differencing has eliminated a;, so OLS estimation of this model leads to 
consistent estimates of G in the FE model. The coefficients of time-invariant 
regressors are not identified, because then x; — 7;,4-1 = 0, as was the case 
for the within estimator. 


The FD estimator is not provided as an option to xt reg. Instead, the 
estimator can be computed by using regress and Stata time-series operators 
to compute the Fps. We have 


. sort id t 


. * FD estimator with cluster--robust standard errors 
. regress D.(lwage exp exp2 wks ed), vce(cluster id) 
note: D.exp omitted because of collinearity. 

note: D.ed omitted because of collinearity. 


Linear regression Number of obs = 3,570 
F(2, 594) = 22.66 
Prob > F = 0.0000 
R-squared = 0.0041 
Root MSE = . 18156 


(Std. err. adjusted for 595 clusters in id) 


Robust 
D.lwage | Coefficient std. err. t P>|t | [95% conf. interval] 
exp 
D1. O (omitted) 
exp2 
D1. -.0005321 . 0000808 -6.58 0.000 -.0006908 - .0003734 
wks 
D1. -.0002683 .0011783 -0.23 0.820 -.0025824 .0020459 
ed 
D1. O (omitted) 
_cons . 1170654 . 0040974 28.57 0.000 . 1090182 .1251126 


As expected, the coefficient for education is not identified, because ea here 
is time invariant. The coefficient for wks actually changes sign compared 
with the other estimators, though it is highly statistically insignificant. 


The FD estimator, like the within estimator, provides consistent 
estimators when the individual effects are fixed. For panels with T = 2, the 
FD and within estimators are equivalent; otherwise, the two differ. For static 
models, the FE model is used because it is the efficient estimator 1f the 
idiosyncratic error Eit is 1.1.d. 


The FD estimator seemingly uses one less year of data compared with the 
within estimator because the FD output lists 3,570 observations rather than 
4,165. This, however, is misleading. Using the LSDV interpretation of the 
within estimator, the within estimator essentially loses 595 observations by 
estimating the T fixed effects Q1,..., QT. 


The xt ivreg command presented in section 9.2.2 has an option for 
FD estimation. For the current example, the command xtivreg2 lwage exp 
exp2 wks ed, fd cluster (id) small yields exactly the same results. 


8.9.2 Strict and weak exogeneity 


From (8.6), the within estimator requires that ¢;, — €; be uncorrelated with 
Xit — X;. This is the case under the assumption of strict exogeneity or strong 
exogeneity that 


Eleal Xil s Xar) =O 


From (8.12), the FD estimator requires that £it — €i,t+—1 be uncorrelated with 
Xit — X;,t-1. This is the case under the assumption of weak or sequential 
exogeneity that 


Eleal Kara SO 


This is a considerably weaker assumption because it permits future values of 
the regressors to be correlated with the error, as will be the case if the 
regressor is a lagged dependent variable. 


As long as there is no feedback from the idiosyncratic shock today to a 
covariate tomorrow, this distinction is unnecessary when fitting static 
models. It becomes important for dynamic models (see section 9.4) because 
then strict exogeneity no longer holds and we turn to the FD estimator. 


8.10 Panel-data management 


Stata xt commands require panel data to be in long form, which means that 
each individual—time pair is a separate observation. Some datasets instead 
store panel data in wide form, which has the advantage of using less space. 
Sometimes, the observational unit is the individual, and a single observation 
has all time periods for that individual. And sometimes the observational 
unit is a time period, and a single observation has all individuals for that 
time period. 


We illustrate how to move from wide form to long form and vice versa 
by using the reshape command. Our example is for panel data, but reshape 
can also be used in other contexts where data are grouped, such as clustered 
data grouped by village rather than panel data grouped by time. 


For data management commands for unbalanced samples, see 
section 19.11. 


8.10.1 Wide-form data 


We consider a small dataset that is originally in wide form, with each 
observation containing all years of data for an individual. The dataset is a 
subset of the data described in section 9.5.1. Each observation is a state and 
has all years of data for that state. We have 


. * Wide-form data (observation is a state) 
. qui use mus208cigarwide, clear 


. list, clean 
state 1lnp63 1nc63 1lnp64 1nc64 1np65 1nc65 


1; 1 4.5 4.5 4.6 4.6 4.5 4.6 
2. 2 4.4 4.8 4.3 4.8 4.3 4.8 
3. 3 4.5 4.6 4.5 4.6 4.5 4.6 
4. 4 4.4 5.0 4.4 4.9 4.4 4.9 
5. 5 4.5 5.1 4.5 5.0 4.5 5.0 
6. 6 4.5 5.1 4.5 5.1 4.5 5.1 
T. 7 4.3 5.5 4.3 5.5 4.3 5.5 
8. 8 4.5 4.9 4.6 4.8 4.5 4.9 
9. 9 4.5 4.7 4.5 4.7 4.6 4.6 
10. 10 4.5 4.6 4.6 4.5 4.5 4.6 


The data contain a state identifier, state; three years of data on log price, 
1np63—1npé65; and three years of data on log sales, 1nc63—1ncé65. The data 
are for 10 states. 


8.10.2 Convert wide form to long form 


The data can be converted from wide form to long form by using reshape 
long. The desired dataset will have an observation as a state—year pair. The 
variables should be a state identifier, a year identifier, and the current state— 
year observations on 1np and Inc. 


The simple command reshape long actually does this automatically 
because it interprets the suffixes 63—65 as denoting the grouping that needs 
to be expanded to long form. We use a more detailed version of the 
command that spells out exactly what we want to do and leads to exactly the 
same result as reshape long without arguments. We have 


. * Convert from wide form to long form (observation is a state-year pair) 
. reshape long lnp lnc, i(state) j (year) 
(j = 63 64 65) 


Data Wide -> Long 
Number of observations 10 -> 30 
Number of variables 7 -> 4 
j variable (3 values) -> year 


xij variables: 
lnp63 lnp64 lnp65 -> Inp 
lnc63 lnc64 lnc65 -> lnc 


The output indicates that we have expanded the dataset from 10 observations 
(10 states) to 30 observations (30 state—year pairs). A year-identifier 
variable, year, has been created. The wide-form data 1np63—1np65 have 
been collapsed to 1np in long form, and 1nc63—1ncé65 have been collapsed to 


Ine: 


We now list the first six observations of the new long-form data. 


. * Long-form data (observation is a state) 
. list in 1/6, sepby(state) 


state year lnp lnc 
1. 1 63 4.5 4.5 
2. 1 64 4.6 4.6 
3. 1 65 4.5 4.6 
4. 2 63 4.4 4.8 
5. 2 64 4.3 4.8 
6. 2 65 4.3 4.8 


Any year-invariant variables will also be included in the long-form data. 
Here the state-identifier variable, state, is the only such variable. 


8.10.3 Convert long form to wide form 


Going the other way, data can be converted from long form to wide form by 
using reshape wide. The desired dataset will have an observation as a state. 
The constructed variables should be a state identifier and observations on 
1np and inc for each of the three years 63—65. 


The reshape wide command without arguments actually does this 
automatically because it interprets year as the relevant time identifier and 
adds suffixes 63—65 to the variables 1np and 1nc, which are varying with 
year. We use a more detailed version of the command that spells out exactly 
what we want to do and leads to exactly the same result. We have 


. * Reconvert from long form to wide form (observation is a state) 
. reshape wide lnp lnc, i(state) j(year) 
(j = 63 64 65) 


Data Long -> Wide 
Number of observations 30 => 10 
Number of variables 4 -> 7 

j variable (3 values) year -> (dropped) 


xij variables: 
lnp -> Inp63 1lnp64 lnp65 
lnc => lnc63 lnc64 lnc65 


The output indicates that we have collapsed the dataset from 30 observations 
(30 state—year pairs) to 10 observations (10 states). The year variable has 
been dropped. The long-form data inp has been expanded to 1np63—1npé65 in 
wide form, and 1nc has been expanded to 1nc63—1ncé65. 


A complete listing of the wide-form dataset is 


. list, clean 


state 1np63  lnc63  lnp64  lnc64 = I1npé65 lnc65 


1; 1 4.5 4.5 4.6 4.6 4.5 4.6 
2. 2 4.4 4.8 4.3 4.8 4.3 4.8 
3. 3 4.5 4.6 4.5 4.6 4.5 4.6 
4. 4 4.4 5.0 4.4 4.9 4.4 4.9 
5. 5 4.5 5.1 4.5 5.0 4.5 5.0 
6. 6 4.5 5.1 4.5 5.1 4.5 5.1 
T. T7 4.3 5.5 4.3 5.5 4.3 5.5 
8. 8 4.5 4.9 4.6 4.8 4.5 4.9 
9. 9 4.5 4.7 4.5 4.7 4.6 4.6 
10. 10 4.5 4.6 4.6 4.5 4.5 4.6 


This is exactly the same as the original mus208cigarwide.dta dataset, listed 
in section 8.10.1. 


8.10.4 An alternative wide-form data 


The wide form we considered had each state as the unit of observation. An 
alternative is that each year is the observation. Then the preceding 
commands are reversed so that we have i (year) j (state) rather than 
i(state) j (year). 


To demonstrate this case, we first need to create the data in wide form 
with year as the observational unit. We do so by converting the current data, 
in wide form with state as the observational unit, to long form with 30 
observations as presented above and then use reshape wide to create wide- 
form data with year as the observational unit. 


* Create alternative wide-form data (observation is a year) 
. qui reshape long lnp lnc, i(state) j(year) 


. reshape wide lnp Inc, i(year) j(state) 
(j =1234567 89 10) 


Data Long -> Wide 

Number of observations 30 -> 3 

Number of variables 4 -> 21 

j variable (10 values) state -> (dropped) 

xij variables: 
lnp -> lnpi lnp2 ... lnp10 
lnc -> inci lnc2 ... lnc10 


. list year lnp1 Inp2 lnc1 lnc2, clean 
year Inpl lnp2 lnc1 lnc2 


1. 63 4.5 4.4 4.5 4.8 
2. 64 4.6 4.3 4.6 4.8 
3. 65 4.5 4.3 4.6 4.8 


The wide form has 3 observations (1 per year) and 21 variables (inp and inc 
for each of 10 states plus year). 


We now have data in wide form with year as the observational unit. To 
use xt commands, we use reshape long to convert to long-form data with 
an observation for each state—year pair. We have 


* Convert from wide form (observation is year) to long form (year-state) 
. reshape long lnp lnc, i(year) j(state) 
(j=123456/7 89 10) 


Data Wide -> Long 
Number of observations 3 -> 30 
Number of variables 21 -> 4 
j variable (10 values) -> state 
xij variables: 
lnpí lnp2 ... lnp10 -> Inp 
lnc1 lnc2 ... lnc10 -> lnc 


. list in 1/6, clean 


year state inp lnc 


1. 63 1 4.5 4.5 
2. 63 2 4.4 4.8 
3. 63 3 4.5 4.6 
4. 63 4 4.4 5.0 
5. 63 5 4.5 5.1 
6. 63 6 4.5 5.1 


The data are now in long form, as in section 8.10.2. 


8.11 Additional resources 


FE and RE estimators appear in many econometrics texts; Wooldridge (2010) 
in particular has a very extensive coverage. Standard panel texts are 
Baltagi (2021) and Hsiao (2014). The key Stata reference is [xT] Stata 
Longitudinal/Panel-Data Reference Manual, especially [XT] xt and 

[xT] xtreg. Useful online help categories include xt and xtreg. 


8.12 Exercises 


1. For the data of section 8.3, use xt sum to describe the variation in occ, 
smsa, ind, ms, union, fem, and blk. Which of these variables are time 
invariant? Use xttab and xttrans to provide interpretations of how 
occ Changes for individuals over the seven years. Provide a time-series 
plot of exp for the first 10 observations, and provide interpretation. 
Provide a scatterplot of 1wage against ed. Is this plot showing within 
variation, between variation, or both? 

2. For the data of section 8.3, manually obtain the three standard 
deviations of lwage given by the xtsum command. For the overall 
standard deviation, use summarize. For the between standard 
deviation, compute by id: egen meanwage = mean(lwage), and apply 
summarize tO (meanwage-grandmean) for t==1, where grandmean is 
the grand mean over all observations. For the within standard 
deviation, apply summarize tO (lwage-meanwage). Compare your 
standard deviations with those from xtsum. Does s8 ~ s%, + 82? 

3. For the model and data of section 8.4, compare PFGLS estimators under 
the following assumptions about the error process: independent, 
exchangeable, AR(2), and MA(6). Also, compare the associated 
standard-error estimates obtained by using default standard errors and 
by using cluster—robust standard errors. You will find it easiest if you 
combine results using estimates table. What happens if you try to fit 
the model with no structure placed on the error correlations? 

4. For the model and data of section 8.5, obtain the within estimator by 
applying regress to (8.7). Hint: For example, for variable x, type by 
id: egen avex = mean (x) followed by summarize x and then generate 
mdx = x - avex + r(mean). Verify that you get the same estimated 
coefficients as you would with xtreg, fe. 

5. For the model and data of section 8.6, compare the RE estimators that 
were obtained by using xt reg with the re, mle, and pa options and 
xtgee with the corr (exchangeable) option. Also, compare the 
associated standard-error estimates obtained by using default standard 
errors and by using cluster—robust standard errors. You will find it 
easiest if you combine results using estimates table. 


6. Consider the RE model output given in section 8.7. Verify that, given 
the estimated values of e sigma and u_sigma, application of the 
formulas in that section leads to the estimated values of rho and 
theta. 


Chapter 9 
Linear panel-data models: Extensions 


9.1 Introduction 


The essential panel methods for linear models, most notably, the important 
distinction between fixed-effects (FE) and random-effects (RE) models, were 
presented in chapter 8. 


In this chapter, we present other panel methods for the linear model. For 
short panels, we consider instrumental-variables (Iv) estimation and 
estimation of dynamic models with lagged dependent variables as 
regressors. We also provide a brief discussion of estimation of long panels. 
Nonlinear panel models are presented in chapter 22. 


9.2 Panel instrumental-variables estimation 


Iv methods have been extended from cross-sectional data (see chapter 7 for 
an explanation) to panel data. Estimation still needs to eliminate the a, if the 
FE model is appropriate, and inference needs to control for the clustering 
inherent in panel data. 


In this section, we detail xt ivreg, which is a panel extension of the 
cross-sectional command ivregress. The subsequent two sections present 
more specialized Iv estimators and commands that are applicable in 
situations where regressors from periods other than the current period are 
used as instruments. 


9.2.1 Panel IV 


If a pooled model is appropriate with y;, = a + x/,G + uj, and instruments 
Zit exist satisfying E (uit|Zit) = 0, then consistent estimation is possible by 
two-stage least-squares (2SLS) regression of yit on Xit With instruments Zit. 
The ivregress command can be used, with subsequent statistical inference 
based on cluster—robust standard errors. 


More often, we use the individual-effects model 


Yit = Xb + Qi + Eit, t= Peers be t= dlra N. (9.1) 


which has two error components, a; and £;t. The FE and first-difference (FD) 
estimators provide consistent estimates of the coefficients of the time- 
varying regressors under a limited form of endogeneity of the regressors— 
Xit may be correlated with the fixed effects a; but not with Eit. 


We now consider a richer type of endogeneity, with x;; correlated with 
Eit. We need to assume the existence of instruments Z; that are correlated 
with X;+¿ and uncorrelated with €;;. The panel Iv procedure is to suitably 
transform the individual-effects model to control for a; and then apply Iv to 
the transformed model. 


9.2.2 The xtivreg command 


The xt ivreg command implements 2SLS regression with options 
corresponding to those for the xt reg command. The syntax is similar to that 
for the cross-sectional ivregress command: 


xtivreg depvar | varlist1 | Cvarlist2=varlist_iv) [ af | [ in | E options ] 


The four main options are fe, fd, re, and be. The fe option performs within 
2SLS regression of yit — Y; on an intercept and x;; — X; with the instruments 
Zit — Zi. The fa option, not available in xtreg, performs FD 2SLS regression 
of Yit — Yi,t—1 On an intercept and Xit — X;,4-1 with the instruments 

Zit — Zi,t—1. The re option performs RE 2SLs regression of y,, — OT; on an 
intercept and x,, — 6;x; With the instruments z,, — @,z,, and the additional 
options ec2s1s and nosa provide variations of this estimator. The be option 
performs between 2SLS regression of y; on X; with instruments z;. Other 
options include first to report first-stage regression results and regress to 
ignore the instruments and instead estimate the parameters of the 
transformed model by ordinary least squares (OLS). The vce () options 
include heteroskedastic—robust and cluster—robust standard errors. 


The community-contributed xt ivreg2 command (Schaffer 2005) 
provides a broader range of Iv-related estimators, including 2SLS, GMM, LIML, 
k-class and continuous updating, for FE and FD models. The command 
includes tests of weak instruments and overidentifying restrictions. 


As usual, exogenous regressors are instrumented by themselves. For 
endogenous regressors, we can proceed as in the cross-sectional case, 
obtaining an additional variable that does not directly determine yit but is 
correlated with the variable being instrumented. In the simplest case, the 
instrument is an external instrument, a variable that does not appear directly 
as a regressor in the model. This is the same Iv identification strategy as that 
used with cross-sectional data. 


9.2.3 Application of the xtivreg command 


Consider the section 8.3 example of regression of lwage on exp, exp2, wks, 
and ed. We assume the experience variables exp and exp2 are exogenous and 
that ed is correlated with the time-invariant component of the error but is 
uncorrelated with the time-varying component of the error. Given just these 
assumptions, we need to control for fixed effects. From section 8.6, the 
within estimator yields consistent estimates of coefficients of exp, exp2, and 
wks, whereas the coefficient of ed is not identified because it is a time- 
invariant regressor. 


Now suppose that the regressor wks is correlated with the time-varying 
component of the error. Then the within estimator becomes inconsistent, and 
we need to instrument for wks. We suppose that ms (marital status) is a 
suitable instrument. This requires an assumption that marital status does not 
directly determine the wage rate but is correlated with weeks worked. 
Because the effects here are fixed, the fe or fa option of xtivreg needs to 
be used. 


Formally, we have assumed that the instruments—here exp, exp2, and ms 
—-satisfy the strong exogeneity assumption that 


E(€jt|Q4, Zi1,---, Zit,---, Zier) = 0 


so that instruments and errors are uncorrelated in all periods. One 
consequence of this strong assumption is that panel Iv estimators are 
consistent even if the €it are serially correlated, so cluster—robust standard 
errors could be used. 


We use xtivreg with the fe option to eliminate the fixed effects. We 
drop the unidentified time-invariant regressor ed—the same results are 
obtained if it is included. We obtain 


. * Panel IV example: FE with wks instrumented by external instrument ms 
. qui use mus208psid 


. xtivreg lwage exp exp2 (wks = ms), fe vce(robust) 


Fixed-effects (within) IV regression Number of obs = 4,165 
Group variable: id Number of groups = 595 
R-squared: Obs per group: 
Within = : min = T 
Between = 0.0172 avg = 7.0 
Overall = 0.0284 max = T 


Wald chi2(3) 12295.89 
corr(u_i, Xb) = -0.8499 Prob > chi2 = 0.0000 


(Std. err. adjusted for 595 clusters in id) 


Robust 
lwage | Coefficient std. err. z P>lz| [95% conf. interval] 
wks -.1149742 . 3707276 -0.31 0.756 -.8415868 .6116385 
exp . 1408101 .0902921 1.56 0.119 -.0361591 .3177794 
exp2 -.0011207 .0022925 -0.49 0.625 -.0056139 . 0033726 
_cons 9.83932 16.75004 0.59 0.557 -22.99015 42.66879 
sigma_u 1.0980369 
sigma_e .51515503 
rho .81959748 (fraction of variance due to u_i) 


Instrumented: wks 
Instruments: exp exp2 ms 


The estimates imply that, surprisingly, wages decrease by 11.5% for each 
additional week worked, though the coefficient is statistically insignificant. 
Wages increase with experience until a peak at 64 years 
[= 0.1408/(2 x 0.0011)]. 


Comparing the Iv results with those given in section 8.5.3 using xtreg, 
fe, we see the coefficient of the endogenous variable wks has changed sign 
and is many times larger in absolute value, whereas the coefficients of the 
exogenous experience regressors are less affected. For these data, the Iv 
standard errors are more than 10 times larger. Because the instrument ms is 
not very correlated with wks, Iv regression leads to a substantial loss in 
estimator efficiency. 


9.2.4 Panel IV extensions 


We used the external instrument ms as the instrument for wks. 


An alternative is to use wks from a period other than the current period as 
the instrument. This has the attraction of being reasonably highly correlated 
with the variable being instrumented, but it is not necessarily a valid 
instrument. In the simplest panel model y;, = x‘, + Eit if the errors €i are 
independent, then any variable in any period that is not in Xiz is a valid 
instrument. Once we introduce an individual effect as in (9.1) and mean- 
transform the model, more care is needed. 


The next two sections present, respectively, the Hausman—Taylor 
estimator and the Arellano—Bond estimator, which use as instruments 
regressors from periods other than the current period. In some settings, a 
Bartik or shift-share instrument can be used; see Goldsmith-Pinkham, 
Sorkin, and Swift (2020). 


9.3 Hausman-—Taylor estimator 


We consider the FE model. The FE and FD estimators provide consistent 
estimators but not for the coefficients of time-invariant regressors because 
these are then not identified. The Hausman—Taylor estimator is an Iv 
estimator that additionally enables the coefficients of time-invariant 
regressors to be estimated. It does so by making the stronger assumption that 
some specified regressors are uncorrelated with the fixed effects. Then 
values of these regressors in periods other than the current period can be 
used as instruments. 


9.3.1 Hausman—Taylor estimator 


The key step is to distinguish between regressors uncorrelated with the fixed 
effects and those potentially correlated with the fixed effects. The method 
additionally distinguishes between time-varying and time-invariant 
regressors. 


The individual-effects model is then written as 


Yit = X81 + Xib + WV, + WoiVo + Qi + Eit (9.2) 


where regressors with subscript 1 are specified to be uncorrelated with a; 
and regressors with subscript 2 are specified to be correlated with a;, w 
denotes time-invariant regressors, and x now denotes time-varying 
regressors. All regressors are assumed to be uncorrelated with €;+, whereas 
xtivreg explicitly deals with such correlation. 


The Hausman—Taylor method is based on the RE transformation that 
leads to the model 


~ >! I ~y ~y ~ ~ 
Yit = X1 441 + Xoi4Bo + Wiii + WaiVo + Qi + Eit 


where, for example, X,,, = x14, — O;Z and the formula for A, is given in 
[xT] xthtaylor. 


The RE transformation is used because, unlike the within transform, here 
wi; Æ 0 and wo; Æ 0, so Yı and Y2 can be estimated. But 
a; =a,(1— 0;) + 0, so the fixed effect has not been eliminated, and &; is 
correlated with x5;, and with wo;. This correlation is dealt with by Iv 
estimation. For X,;;, the instrument used is X9;, = Xo;+ — Xə; which can be 
shown to be uncorrelated with &i. For Wə;, the instrument is X,;, so the 
method requires that the number of time-varying exogenous regressors be at 
least as large as the number of time-invariant endogenous regressors. The 
method uses X,;; as an instrument for x,,;, and W1; as an instrument for W4; 
Essentially, X1 is used as an instrument twice: as X,,; and as X,;. By using 
the average of X4; in forming instruments, we are using data from other 
periods to form instruments. 


9.3.2 The xthtaylor command 


The xthtaylor command performs Iv estimation of the parameters of (9.2) 
using the instruments X1;+, Xo;z, Wii, and X,;. The syntax of the command is 


xthtaylor depvar indepvars [ of | [ in | | weight |, endog (varlist) | options | 


Here all the regressors are given in indepvars, and the subset of these that 
are potentially correlated with a; are given in endog (varlist) . The 
xthtaylor command provides an option to compute cluster—robust standard 
errors. 


The options include amacurdy, which uses a wider range of instruments. 
Specifically, the Hausman—Taylor method requires that x); be uncorrelated 
with @;. If each xXi#, t = 1,...,7, 1s uncorrelated with a;, then more 
instruments are available, and we can use as instruments X1;;, Xoi Wii, and 
Xlil; -+< XiT. 


9.3.3 Application of the xthtaylor command 


The dataset used here, attributed to Baltagi and Khanti-Akom (1990) and 
Cornwell and Rupert (1988), was originally applied to the Hausman—Taylor 
estimator. We reproduce that application here. It uses a wider set of 
regressors than we have used to this point. 


The goal is to obtain a consistent estimate of the coefficient ea because 
there is great interest in the impact of education on wages. Education is 
clearly endogenous. It is assumed that education is correlated only with the 
individual-specific component of the error a;. In principle, within estimation 
gives a consistent estimator, but in practice, no estimator is obtained because 
ed 1s time invariant, so its coefficient cannot be estimated. 


The Hausman—Taylor estimator is used instead, assuming that only a 
subset of the regressors is correlated with @;. Identification requires that 
there be at least one time-varying regressor that is uncorrelated with the 
fixed effects. Cornwell and Rupert (1988) assumed that, for the time-varying 
regressors, exp, exp2, wks, ms, and union were endogenous, whereas occ, 
south, smsa, and ind were exogenous. And for the time-invariant regressors, 
ed is endogenous and fem and b1k are exogenous. The xthtaylor command 
requires distinction only between endogenous and exogenous regressors 
because it can determine which regressors are time varying and which are 
not. 


We obtain 


* Hausman-Taylor example of Baltagi and Khanti-Akom (1990) 
. xthtaylor lwage occ south smsa ind exp exp2 wks ms union fem blk ed, 


> endog(exp exp2 wks ms union ed) vce(robust) 


Hausman-Taylor estimation Number of obs = 4,165 
Group variable: id Number of groups = 595 
Obs per group: 

min = 7 

avg = 7 

max = 7 

Random effects u_i ~ i.i.d. Wald chi2(12) = 33204.76 

Prob > chi2 = 0.0000 

(Std. err. adjusted for 595 clusters in id) 

Robust 

lwage | Coefficient std. err. Zz P>lz| [95% conf. interval] 
TVexogenous 

occ - .0207047 .0190066 -1.09 0.276 0579569 0165475 

south .0074398 .0785588 0.09 0.925 . 1465327 . 1614123 

smsa - .0418334 .0285698 -1.46 0.143 .0978291 .0141624 

ind .0136039 .0222459 0.61 0.541 .0299973 .0572051 
TVendogenous 

exp . 1131328 . 0040568 27.89 0.000 .1051816 . 1210839 

exp2 -.0004189 . 0000823 -5.09 0.000 .0005803 -.0002575 

wks . 0008374 . 0008666 0.97 0.334 .0008611 . 0025359 

ms -.0298508 .0268188 -1.11 0.266 .0824147 .0227132 

union .0327714 .0250575 1.31 0.191 .0163404 .0818832 
TIexogenous 

fem -. 1309236 .11753 -1.11 0.265 .3612781 . 0994309 

blk -. 2857479 . 1704617 -1.68 0.094 .6198467 . 0483509 
TIendogenous 

ed . 137944 .0216481 6.37 0.000 .0955145 . 1803734 

_cons 2.912726 . 307723 9.47 0.000 2.3096 3.515852 

sigma_u . 94180304 
sigma_e . 15180273 
rho .97467788 (fraction of variance due to u_i) 


Note: TV refers to time varying; TI refers to time invariant. 


Compared with the RE estimates given in section 8.7, the coefficient of ed 
has increased from 0.112 to 0.138, and the standard error has increased from 
0.0084 to 0.0216. 


For the regular Iv estimator to be consistent, it is necessary to argue that 
any instruments are uncorrelated with the error term. Similarly, for the 


Hausman—Taylor estimator to be consistent, it is necessary to argue that all 
regressors are uncorrelated with the idiosyncratic error £i: and that a 
specified subset of the regressors is uncorrelated with the fixed effects a;. 
This strong assumption can be tested by the community-contributed 
command xtoverid following command xthtaylor. 


9.4 Arellano—Bond estimator 


With panel data, the dependent variable is observed over time, opening up the 
possibility of estimating parameters of dynamic models that specify the 
dependent variable for an individual to depend in part on its values in previous 
periods. As in the nonpanel case, however, care is needed because OLS with a 
lagged dependent variable and serially correlated error leads to inconsistent 
parameter estimates. 


We consider estimation of FE models for short panels when one or more lags 
of the dependent variable are included as regressors. Then the fixed effect needs 
to be eliminated by first differencing rather than mean differencing for reasons 
given at the end of section 9.4.1. Consistent estimators can be obtained by Iv 
estimation of the parameters in the FD model, using appropriate lags of regressors 
as the instruments. This estimator, called the Arellano—Bond estimator, can be 
performed, with some considerable manipulation, by using the rv command 
ivregress Or xtivreg. But it is much easier to use the specialized commands 
xtabond, xtdpdsys, and xtdpd. These commands also enable more efficient 
estimation and provide appropriate model specification tests. 


9.4.1 Dynamic model 


The general model considered is an autoregressive model of order p in Yit [an AR( 
p) model] with Yi,t-1,---,Yi,t-p as regressors, as well as the regressors X;z. The 
model is 


Yit = V1Yit-1 tH YpYi t-p t Xab tater, t=pt+l1,...,T (9.3) 


where a; is a fixed effect. The regressors X; are initially assumed to be 
uncorrelated with €;:, an assumption that is relaxed in section 9.4.8. The goal is to 
consistently estimate 71,- - - , Yp and B when a; is a fixed effect. The estimators 
are also consistent if œ; is a random effect. 


The dynamic model (9.3) provides several quite different reasons for 
correlation in y over time: 1) directly through y in preceding periods, called true 
state dependence; 2) directly through observables x, called observed 


heterogeneity; and 3) indirectly through the time-invariant individual effect ai, 
called unobserved heterogeneity. 


These reasons have substantively different policy implications. For 
illustration, consider a pure AR(1) time-series model for earnings 
Yit = V1Yi,t—1 + Qi + Eit With ci ~ 0 for t > 1. Suppose in period 1 there is a 
large positive shock, €i1, leading to a large value for yi1, moving a low-paid 
individual to a high-paying job. Then, if yı ~ 1, earnings will remain high in 
future years (because y; 441 ~ Yit + ai). If instead y1 ~ 0, earnings will return to 
a; in future years (because y; 441 ~ ai). 


Note that the within estimator is inconsistent once lagged regressors are 
introduced. This is because the within model will have the first regressor 
Yi,t—1 — Y; that is correlated with the error €;, — €; because ¥i,t—1 is correlated 
with €:,t—1 and hence with €;. Furthermore, Iv estimation using lags is not 
possible, because any lag Yi,s will also be correlated with €; and hence with 
Eit — Ei. By contrast, although the FD estimator is also inconsistent, Iv estimators 
of the FD model that use appropriate lags of Yit as instruments do lead to 
consistent parameter estimates. 


9.4.2 IV estimation in the FD model 


The FD model is 


Ayit = Ayia +++ + YpAYi,t-p + AX b + Ace, t=p+1,...,T (9.4) 


We make the crucial assumption that £;i are serially uncorrelated, a departure 
from most analysis to this point that has permitted £; to be correlated over time 
for a given individual. This assumption is testable, is likely to hold if p is 
sufficiently large, and can be relaxed by using xt dpa, presented in section 9.4.8. 


In contrast to a static model, OLS on the first-differenced data produces 
inconsistent parameter estimates because the regressor Ay; ¢—1 is correlated with 
the error Acis, even if Ei are serially uncorrelated. For serially uncorrelated €iz, 
the FD model error A£;t = Eit — Ei t—1 iS correlated with 
AYyit—1 = Yi,t—1 — Yi,t—2 because Yi,z—1 depends on €i,c—-1. At the same time, 
Ac; is uncorrelated with Ay; t- for k > 2, opening up the possibility of Iv 
estimation using lagged variables as instruments. 


Anderson and Hsiao (1981) proposed Iv estimation using Yi,t—-2, which is 
uncorrelated with Ac;+ as an instrument for Ay; +1. The other lagged dependent 
variables can be instruments for themselves. The regressors x; can be used as 
instruments for themselves if they are strictly exogenous; otherwise, they can also 
be instrumented as detailed below. 


More efficient Iv estimators can be obtained by using additional lags of the 
dependent variable as an instrument; see Holtz-Eakin, Newey, and Rosen (1988). 
The estimator is then called the Arellano—Bond estimator after Arellano and 
Bond (1991), who detailed implementation of the estimator and proposed tests of 
the crucial assumption that €;; are serially uncorrelated. Because the instrument 
set is unbalanced and can be quite complicated, Stata provides the distinct 
command xtabond. 


9.4.3 The xtabond command 


The xtabond command has the syntax 
xtabond depvar [ indepvars | [ of | lin] [ ; options | 


The number of lags in the dependent variable, p in (9.4), is defined by using the 
lags (#) option with the default p = 1. The regressors are declared in different 
ways depending on the type of regressor. 


First, strictly exogenous regressors are uncorrelated with Eit, require no 
special treatment, are used as instruments for themselves, and are entered as 
indepvars. 


Second, predetermined regressors or weakly exogenous regressors are 
correlated with past errors but are uncorrelated with future errors: E (£it€is) 4 0 
for s < t, and E(zitE£is) = 0 for s > t. These regressors can be instrumented in 
the same way that ¥i,t—1 is instrumented using subsequent lags of Yi,t-1. 
Specifically, Zit is instrumented by 7i,t—1, Vi,t—2, .... These regressors are 
entered by using the pre(varlist) option. 


Third, a regressor may be contemporaneously endogenous: E (£itE£is) Æ 0 for 
s < t, and E(£itEis) = 0 for s > t. Now E (£itEit) Æ 0, So Vi,t—-1 is no longer a 
valid instrument in the FD model. The instruments for Lit are now 7i,t—2, Vi,t-3, 
.... These regressors are entered by using the endogenous (varlist) option. 


Finally, additional instruments can be included by using the inst (varlist) 
option. 


Potentially, many instruments are available, especially if T is large. If too 
many instruments are used, then asymptotic theory provides a poor finite-sample 
approximation to the distribution of the estimator. The maxldep (#) option sets 
the maximum number of lags of the dependent variable that can be used as 
instruments. The maxlags (#) option sets the maximum number of lags of the 
predetermined and endogenous variables that can be used as instruments. 
Alternatively, the lagstruct (lags, endlags) suboption can be applied 
individually to each variable in pre (varlist) and endogenous (varlist). 


Two different Iv estimators can be obtained; see section 7.3. The 2SLs 
estimator, also called the one-step estimator, is the default. Because the model is 
overidentified, more efficient estimation is possible using optimal generalized 
method of moments (GMM), also called the two-step estimator because first-step 
estimation is needed to obtain the optimal weighting matrix used at the second 
step. The optimal GMM estimator is obtained by using the twostep option. 


The vce (robust) option provides a heteroskedastic-consistent estimate of the 
variance—covariance matrix of the estimator (vce). If the €iz are serially 
correlated, the estimator is no longer consistent, so there is no cluster-robust VCE 
for this case. 


Postestimation commands for xtabond include estat abona, to test the 
critical assumption of no error correlation, and estat sargan, to perform an 
overidentifying restrictions test; see section 9.4.6. 


9.4.4 Arellano—Bond estimator: Pure time series 


For concreteness, consider an AR(2) model for 1nwage with no other regressors 
and seven years of data. Then we have sufficient data to obtain Iv estimates in the 
model 


Ayit = a+ 1 AYyit—-1 + YaAYit—2 + AE, t= 4,5,6,7 


At t = 4, there are two available instruments, yi1 and yi2, because these are 
uncorrelated with Ag;4. At t = 5, there are now three instruments, yi1, Yi2, and 


yi3, that are uncorrelated with Ac;;. Continuing in this manner at ¢ = 6, there are 
four instruments, Yil, ..., Yis; and at t = 7, there are five instruments, Yil, ..., Yis. 
In all, there are 2 + 3 + 4 + 5 = 14 available instruments for the two lagged 
dependent variable regressors. Additionally, the intercept is an instrument for 
itself. Estimation can be by 2SLs or by the more efficient optimal GMM, which is 
possible because the model is overidentified. Because the instrument set is 
unbalanced, it is much easier to use xtabond than it is to manually set up the 
instruments and use ivregress. 


We apply the estimator to an AR(2) model for the wages data, initially without 
additional regressors. 


. * 2SLS or one-step GMM for a pure time-series AR(2) panel model 
. qui use mus208psid, clear 


. xtabond lwage, lags(2) vce(robust) 


Arellano-Bond dynamic panel-data estimation Number of obs = 2,380 
Group variable: id Number of groups = 595 
Time variable: t 

Obs per group: 


min = 4 

avg = 4 

max = 4 

Number of instruments = 15 Wald chi2(2) = 1253.03 
Prob > chi2 = 0.0000 


One-step results 
(Std. err. adjusted for clustering on id) 


Robust 
lwage | Coefficient std. err. Zz P>lz| [95% conf. interval] 
lwage 
Li. 5707517 . 0333941 17.09 0.000 . 5053005 . 6362029 
L2. . 2675649 0242641 11.03 0.000 . 2200082 3151216 
_cons 1.203588 . 164496 7.32 0.000 .8811814 1.525994 


Instruments for differenced equation 
GMM-type: L(2/.).lwage 

Instruments for level equation 
Standard: _cons 


There are 4 x 595 = 2380 observations because the first three years of data are 
lost to construct Ay; +2. The results are reported for the original levels model, 
with the dependent variable yiz and the regressors the lagged dependent variables 
Yit—1 and Yi,t—2, even though mechanically the FD model is fit. There are 15 
instruments, as already explained, with output L(2/.), meaning that Yi,t-2, Yi,t—3 


, ---, Yi,1 are the instruments used for period t. Wages depend greatly on past 
wages, with the lag weights summing to 0.57 + 0.27 = 0.84. 


The results given are for the 2SLS or one-step estimator. The standard errors 
reported are robust standard errors that permit the underlying error €it to be 
heteroskedastic but do not allow for any serial correlation in €i, because then the 
estimator is inconsistent. 


More efficient estimation is possible using optimal or two-step GMM because 
the model is overidentified. Standard errors reported using the standard textbook 
formulas for the two-step GMM estimator are downward biased in finite samples. 
A better estimate of the standard errors, proposed by Windmeijer (2005), can be 
obtained by using the vce (robust) option. As for the one-step estimator, these 
standard errors permit heteroskedasticity in Eit. 


Two-step GMM estimation for our data yields 


. * Optimal or two-step GMM for a pure time-series AR(2) panel model 
. xtabond lwage, lags(2) twostep vce(robust) 


Arellano-Bond dynamic panel-data estimation Number of obs = 2,380 
Group variable: id Number of groups = 595 
Time variable: t 

Obs per group: 


min = 4 

avg = 4 

max = 4 

Number of instruments = 15 Wald chi2(2) = 1974.40 
Prob > chi2 = 0.0000 


Two-step results 
(Std. err. adjusted for clustering on id) 


WC-robust 
lwage | Coefficient std. err. Zz P>|zl [95% conf. interval] 
lwage 
L1. .6095931 .0330542 18.44 0.000 . 544808 . 6743782 
L2. . 2708335 .0279226 9.70 0.000 .2161061 . 3255608 
_cons .9182262 .1339978 6.85 0.000 .6555952 1.180857 


Instruments for differenced equation 
GMM-type: L(2/.).lwage 

Instruments for level equation 
Standard: _cons 


Here the one-step and two-step estimators have similar estimated coefficients, 
and the standard errors are also similar, so there is little efficiency gain in two- 


step estimation. 


For a large T, the Arellano—Bond method generates many instruments, 
leading to potential poor performance of asymptotic results. The number of 
instruments can be restricted by using the maxldep() option. For example, we 
may use only the first available lag, so that just Yi,z—2 is the instrument in period t 


. * Reduce the number of instruments for a pure time-series AR(2) panel model 
. xtabond lwage, lags(2) vce(robust) maxldep(1) 


Arellano-Bond dynamic panel-data estimation Number of obs = 2,380 
Group variable: id Number of groups = 595 
Time variable: t 


Obs per group: 


min = 4 

avg = 4 

max = 4 

Number of instruments = 5 Wald chi2(2) = 1372.33 
Prob > chi2 = 0.0000 


One-step results 
(Std. err. adjusted for clustering on id) 


Robust 
lwage | Coefficient std. err. z P>lz| [95% conf. interval] 
lwage 
Li. . 4863642 . 1919353 2.53 0.011 .110178 . 38625505 
L2. . 3647456 . 1661008 2.20 0.028 039194 . 6902973 
_cons 1.127609 . 2429357 4.64 0.000 6514633 1.603754 


Instruments for differenced equation 
GMM-type: L(2/2).1lwage 

Instruments for level equation 
Standard: _cons 


Here there are five instruments: ¥:2 when t = 4, Yiz when t = 5, Yia when t = 6, 
Yis when ¢ = 7, and the intercept is an instrument for itself. 


In this example, there is considerable loss of efficiency because the standard 
errors are now about six times larger. This inefficiency disappears if we instead 
use the maxldep (2) option, yielding 8 instruments rather than the original 15. 


9.4.5 Arellano—Bond estimator: Additional regressors 


We now introduce regressors that are not lagged dependent variables. 


We fit a model for 1wage similar to the model specified in section 9.3. The 
time-invariant regressors fem, b1k, and ed are dropped because they are 
eliminated after first-differencing. The regressors occ, south, smsa, and ind are 
treated as strictly exogenous. The regressor wks appears both contemporaneously 
and with one lag, and it is treated as predetermined. The regressors ms and union 
are treated as endogenous. The first two lags of the dependent variable 1wage are 
also regressors. 


The model omits one very important regressor, years of work experience 
(exp). For these data, it is difficult to disentangle the separate effects of previous 
periods’ wages and work experience. When both are included, the estimates 
become very imprecise. Because here we wish to emphasize the role of lagged 
wages, we exclude work experience from the model. 


We fit the model using optimal or two-step GMM and report robust standard 
errors. The strictly exogenous variables appear as regular regressors. The 
predetermined and endogenous variables are instead given as options, with 
restrictions placed on the number of available instruments that are actually used. 
The dependent variable appears with two lags, and the maxldep (3) option is 
specified so that at most three lags are used as instruments. For example, when 
t = 7, the instruments are yis, Yia, and Yi3. The pre (wks, 1ag(1,2)) option is 
specified so that wks and L1.wks are regressors, and only two additional lags are 
to be used as instruments. The endogenous (ms, 1ag(0,2)) option is used to 
indicate that ms appears only as a contemporaneous regressor and that at most 
two additional lags are used as instruments. The artests (3) option does not 
affect the estimation but will affect the postestimation command estat abond, as 
explained in the next section. We have 


. * Optimal or two-step GMM for a dynamic panel model 
. xtabond lwage occ south smsa ind, lags(2) maxldep(3) 


> pre(wks,lag(1,2)) endogenous(ms,lag(0,2)) 
> endogenous (union,lag(0,2)) twostep vce(robust) artests(3) 
Arellano-Bond dynamic panel-data estimation Number of obs = 2,380 
Group variable: id Number of groups = 595 
Time variable: t 
Obs per group: 
min = 4 
avg = 4 
max = 4 
Number of instruments = 40 Wald chi2(10) = 1287.77 
Prob > chi2 = 0.0000 


Two-step results 
(Std. err. adjusted for clustering on id) 


WC-robust 

lwage | Coefficient std. err. z P>lz| [95% conf. interval] 

lwage 
L1. .611753 .0373491 16.38 0.000 .5385501 . 6849559 
L2. . 2409058 .0319939 7.53 0.000 . 1781989 . 3036127 

wks 
ZAN -.0159751 . 0082523 -1.94 0.053 -.0321493 .000199 
Li. . 0039944 . 0027425 1.46 0.145 -.0013807 . 0093695 
ms . 1859324 . 144458 1.29 0.198 -.0972 . 4690649 
union -.1531329 . 1677842 -0.91 0.361 -.4819839 .1757181 
occ -.0357509 . 0347705 -1.03 0.304 -. 1038999 . 032398 
south -.0250368 . 2150806 -0.12 0.907 -.446587 . 3965134 
smsa - .0848223 .0525243 -1.61 0.106 -.187768 .0181235 
ind .0227008 . 0424207 0.54 0.593 -.0604422 . 1058437 
_cons 1.639999 4981019 3.29 0.001 . 6637377 2.616261 


Instruments for differenced equation 
GMM-type: L(2/4).lwage L(1/2).L.wks L(2/3).ms L(2/3).union 
Standard: D.occ D.south D.smsa D.ind 

Instruments for level equation 
Standard: _cons 


With the inclusion of additional regressors, the coefficients of the lagged 
dependent variables have changed little, and the standard errors are about 10— 
15% higher. The additional regressors are all statistically insignificant at 5%. By 
contrast, some are statistically significant using the within estimator for a static 
model that does not include the lagged dependent variables. 


The output explains the instruments used. For example, 1 (2/4) .1wage means 
that Lwagei,t—2, lwagei,t—3, and 1wagei,t—4 are used as instruments, provided they 
are available. In the initial period ¢ = 4, only the first two of these are available, 


whereas in t = 5, 6, 7, all three are available for a total of 2 + 3 +3+3= 11 
instruments. By similar analysis, L (1/2) .L.wks, L(2/3) .ms, and L (2/3) .union 
each provide 8 instruments, and there are 5 standard instruments. In all, there are 
114+8+8+8-+5 = 40 instruments, as stated at the top of the output. 


9.4.6 Specification tests 


For consistent estimation, the xtabond estimators require that the error €;z be 
serially uncorrelated. This assumption is testable. 


Specifically, if €;; are serially uncorrelated, then As; are correlated with 
Aé;,z—1 because Cov (AE, Ag; 4-1) = Cov 
(Eit — Ei t-1, &4,t-1 7 Ei t—2) = —Cov(Ei t1, Eit—1) Æ 0. But Eit will not be 
correlated with Ae; t- for k > 2. A test of whether Acs; are correlated with 
A£; t—k for k > 2 can be calculated based on the correlation of the fitted 
residuals Aé,,. This is performed by using the estat abona command. 


The default is to test to lag 2, but here we also test the third lag. This can be 
done in two ways. One way is to use estat abond with the artests (3) option, 
which leads to recalculation of the estimator defined in the preceding xtabond 
command. Alternatively, we can include the artests (3) option in xtabond, in 
which case we simply use estat abona and no recalculation is necessary. 


In our case, the artests (3) option was included in the preceding xtabond 
command. We obtain 


. * Test whether error is serially correlated 
. estat abond 


Arellano-Bond test for zero autocorrelation in first-differenced errors 
HO: No autocorrelation 


Order Z Prob > z 
1 -4.5244 0.0000 
2 -1.6041 0.1087 
3 . 35729 0.7209 


The null hypothesis that Cov (Ac;t, A€; t-k) = 0 for k = 1, 2,3 is rejected at a 
level of 0.05 if p < 0.05. As explained above, if £€;¢ are serially uncorrelated, we 
expect to reject at order 1 but not at higher orders. This is indeed the case. We 
reject at order 1 because p = 0.000. At order 2, Ac;, and Ae; 4~2 are serially 
uncorrelated because p = 0.109 > 0.05. Similarly, at order 3, there is no 


evidence of serial correlation because p = 0.721 > 0.05. There is no serial 
correlation in the original error Eit, as desired. 


A second specification test is a test of overidentifying restrictions; see 
section 7.4.8. Here 40 instruments were used to estimate 11 parameters, so there 
are 29 overidentifying restrictions. The estat sargan command implements the 
test. This command is not implemented after xtabond if the vce (robust) option 
is used, because the test is then invalid because it requires that the errors it be 
independent and identically distributed (i.i.d.). We therefore need to first run 
xtabond without this option. We have 


. * Test of overidentifying restrictions (first estimate with no vce(robust) ) 
. qui xtabond lwage occ south smsa ind, lags(2) maxldep(3) 
> pre(wks,lag(1,2)) endogenous (ms,lag(0,2)) 
> endogenous (union,lag(0,2)) twostep artests(3) 
. estat sargan 
Sargan test of overidentifying restrictions 
HO: Overidentifying restrictions are valid 
chi2(29) 39.87571 
Prob > chi2 0.0860 


The null hypothesis that the population moment conditions are correct is not 
rejected, because p = 0.086 > 0.05. 


9.4.7 The xtdpdsys command 


The Arellano—Bond estimator uses an Iv estimator based on the assumption that 
E(yisAeit) = 0 for s < t — 2 in (9.3), so that the lags Yi,t-2, Yi,t-3, ... can be 
used as instruments in the first-differenced (9.4). Several articles suggest using 
additional moment conditions to obtain an estimator with improved precision and 
better finite-sample properties. In particular, Arellano and Bover (1995) and 
Blundell and Bond (1998) consider using the additional condition 

E(Ay;,t-1€it) = 0 so that we also incorporate the levels (9.3) and use as an 
instrument Ay; ż+—1. Similar additional moment conditions can be added for 
endogenous and predetermined variables, whose FDs can be used as instruments. 


This estimator, often called the systems estimator, is performed by using the 
command xtdpdsys. It is also performed by using the community-contributed 
xtabond2 command. The syntax is exactly the same as that for xtabond. 


We refit the model of section 9.4.5 using xtdpdsys rather than xt abond. 


. * Arellano/Bover or Blundell/Bond for a dynamic panel model 
. xtdpdsys lwage occ south smsa ind, lags(2) maxldep(3) 


> pre(wks,lag(1,2)) endogenous(ms,lag(0,2)) 

> endogenous (union,lag(0,2)) twostep vce(robust) artests(3) 

System dynamic panel-data estimation Number of obs = 2,975 
Group variable: id Number of groups = 595 


Time variable: t 
Obs per group: 


min = 5 
avg = 5 
max = 5 
Number of instruments = 60 Wald chi2(10) = 2270.88 
Prob > chi2 = 0.0000 
Two-step results 
WC-robust 
lwage | Coefficient std. err. z P>|z| [95% conf. interval] 
lwage 
Li. 6017533 .0291502 20.64 0.000 5446199 . 6588866 
L2. . 2880537 0285319 10.10 0.000 . 2321322 . 3439752 
wks 
FSu -.0014979 .0056143 -0.27 0.790 -.0125017 . 009506 
L1. . 0006786 .0015694 0.43 0.665 -.0023973 0037545 
ms .0395337 0558543 0.71 0.479 -. 0699386 . 1490061 
union -.0422409 .0719919 -0.59 0.557 - . 1833423 . 0988606 
occ - .0508803 0331149 -1.54 0.124 -.1157843 0140237 
south -. 1062817 . 083753 -1.27 0.204 -.2704346 0578713 
smsa - .0483567 0479016 -1.01 0.313 -.1422422 0455288 
ind .0144749 031448 0.46 0.645 -.0471621 .0761118 
_cons 9584113 . 3632287 2.64 0.008 . 2464961 1.670327 


Instruments for differenced equation 
GMM-type: L(2/4).lwage L(1/2).L.wks L(2/3).ms L(2/3).union 
Standard: D.occ D.south D.smsa D.ind 
Instruments for level equation 
GMM-type: LD.lwage LD.wks LD.ms LD.union 
Standard: _cons 


There are now 60 instruments rather than 40 instruments because the lagged FDs 

in lwage, wks, ms, and union are available for each of the 5 periods t = 3,...,7. 

There is some change in estimated coefficients. More noticeable is a reduction in 
standard errors of 10-60%, reflecting greater precision because of the additional 

moment conditions. 


The procedure assumes the errors €iz are serially uncorrelated. This 
assumption can be tested by using the estat abond postestimation command, and 
from output not given, this test confirms that the errors are serially uncorrelated 


here. If the xtdpdsys command is run with the default standard errors, the estat 
sargan command can be used to test the overidentifying conditions. 


9.4.8 The xtdpd command 


The preceding estimators and commands require that the model errors €;z be 
serially uncorrelated. If this assumption is rejected (it is testable by using the 
estat abond command), then one possibility is to add more lags of the dependent 
variable as regressors in the hope that this will eliminate any serial correlation in 
the error. 


An alternative is to use the xtdpd command that allows €;+ to follow a 
moving-average (MA) process of low order. This command also allows 
predetermined variables to have a more complicated structure. 


For xtdpd, a very different syntax is used to enter all the variables and 
instruments in the model; see [XT] xtdpd. Essentially, one specifies a variable list 
with all model regressors (lagged dependent, exogenous, predetermined, and 
endogenous), followed by options that specify instruments. For exogenous 
regressors, the div() option is used, and for other types of regressors, the 
dgmmiv() option is used with the explicit statement of the lags of each regressor 
to be used as instruments. Instruments for the levels equation, used in the 
xtdpdsys command, can also be specified with the 1gmmiv() option. 


As an example, we provide without explanation an xtdpd command that 
exactly reproduces the xtdpdsys command of the previous section. We have 


. * Use of xtdpd to exactly reproduce the previous xtdpdsys command 
. xtdpd L(0/2).1lwage L(0/1).wks occ south smsa ind ms union, 


> div(occ south smsa ind) dgmmiv(lwage, lagrange(2 4)) 

> dgmmiv(ms union, lagrange(2 3)) dgmmiv(L.wks, lagrange(1 2)) 

> lgmmiv(lwage wks ms union) twostep vce(robust) artests(3) 

Dynamic panel-data estimation Number of obs = 2,975 
Group variable: id Number of groups = 595 


Time variable: t 
Obs per group: 


min = 5 

avg = 5 

max = 5 

Number of instruments = 60 Wald chi2(10) = 2270.88 
Prob > chi2 = 0.0000 


Two-step results 
(Std. err. adjusted for clustering on id) 


WC-robust 

lwage | Coefficient std. err. Zz P>|zl [95% conf. interval] 

lwage 
L1. .6017533 .0291502 20.64 0.000 . 5446199 . 6588866 
L2. . 2880537 .0285319 10.10 0.000 . 2321322 . 3439752 

wks 
“=, -.0014979 0056143 -0.27 0.790 -.0125017 . 009506 
L1. .0006786 .0015694 0.43 0.665 - .0023973 .0037545 
occ -.0508803 0331149 -1.54 0.124 -.1157843 .0140237 
south -.1062817 .083753 -1.27 0.204 -. 2704346 .0578713 
smsa - .0483567 .0479016 -1.01 0.313 -. 1422422 . 0455288 
ind .0144749 .031448 0.46 0.645 -.0471621 .0761118 
ms .0395337 .0558543 0.71 0.479 - . 0699386 . 1490061 
union - .0422409 .0719919 -0.59 0.557 - . 1833423 . 0988606 
-cons .9584113 . 3632287 2.64 0.008 . 2464961 1.670327 


Instruments for differenced equation 
GMM-type: L(2/4).lwage L(2/3).ms L(2/3).union L(1/2).L.wks 
Standard: D.occ D.south D.smsa D.ind 
Instruments for level equation 
GMM-type: LD.lwage LD.wks LD.ms LD.union 
Standard: _cons 


Now suppose that the error €it in (9.3) is MA(1), so that Eit = Nit + ôni, t—1, 
where ‘it is 1.1.d. Then ¥:,t—2 is no longer a valid instrument, but Yi,z—3 and 
further lags are. Also, for the level equation, Ay; +1 is no longer a valid 
instrument, but Ay; ¿—2 is valid. We need to change the dgmmiv() and lgmmiv () 
options for lwage. The command becomes 


. * Previous command if model error is MA(1) 

. xtdpd L(0/2).lwage L(0/1).wks occ south smsa ind ms union, 

> div(occ south smsa ind) dgmmiv(lwage, lagrange(3 4)) 
dgmmiv(ms union, lagrange(2 3)) dgmmiv(L.wks, lagrange(1 2)) 
> lgmmiv(L.lwage wks ms union) twostep vce(robust) artests(3) 


Vv 


(output omitted ) 
9.4.9 The xtabond2 command 


For fitting FE linear dynamic panel models, possibly with serially correlated 
errors, using short panel data, the xtabond2 command (Roodman 2009) provides 
an alternative to the xtabond command. Additionally, Roodman (2009) provides 
a very detailed discussion of the Arellano—Bond method that is very useful to 
read even if one uses the xtabond command. 


The xtabond2 command and the computer output it generates have several 
attractive features. To eliminate the fixed effects, the command as an option uses 
the orthogonal deviations transformation or Helmert transformation, which is a 
more data-saving method for eliminating the fixed effects than the FD 
transformation if data are not available in all time periods. To limit the number of 
instruments it permits both “Gmm-style” and “Iv-style” specification of the lags 
and options for limiting the number of moment conditions used. And the 
xtabond2 command provides an option to obtain Windmeijer-corrected cluster— 
robust efficient two-step standard errors (Windmeijer 2005). 


We illustrate the syntax and use of xt abond2 by reestimating some of the 
specifications considered in previous sections, beginning with two-step GMM 
estimation of the pure AR(2) model with intercept estimated in section 9.4.4 and 
with instruments the second and later lags of the dependent variable. 


Here we use the option h (1), which obtains 2SLs estimates at the first stage, 
and the small option, which bases inference on the ¢ and F distributions rather 
than the normal and chi-squared distributions. 


. * xtabond2: Two-step GMM for a pure time-series AR(2) panel model 
. xtabond2 lwage L.lwage L2.lwage, gmmstyle(L2.lwage) h(1) small twostep robust 
Favoring space over speed. To switch, type or click on mata: mata set matafavor 
> speed, perm. 
Warning: Two-step estimated covariance matrix of moments is singular. 

Using a generalized inverse to calculate optimal weighting matrix for two-step 
> estimation. 

Difference-in-Sargan/Hansen statistics may be negative. 


Dynamic panel-data estimation, two-step system GMM 


Group variable: id Number of obs = 2975 
Time variable : t Number of groups = 595 
Number of instruments = 15 Obs per group: min = 5 
F(2, 594) 2.00e+06 avg = 5.00 
Prob > F 0.000 max = 5 
Corrected 

lwage | Coefficient std. err. t P>|t| [95% conf. interval] 

lwage 
Li. . 8034441 .0763937 10.52 0.000 . 6534096 . 9534787 
L2. . 0938402 .0672123 1.40 0.163 -.0381624 . 2258429 
-cons . 7877662 .1191901 6.61 0.000 .553681 1.021852 


Instruments for first differences equation 
GMM-type (missing=0, separate instruments for each period unless collapsed) 
L(1/6).L2.lwage 
Instruments for levels equation 


Standard 
_cons 
GMM-type (missing=0, separate instruments for each period unless collapsed) 
D.L2.lwage 
Arellano-Bond test for AR(1) in first differences: z = -3.06 Pr > z= 0.002 
Arellano-Bond test for AR(2) in first differences: z = 0.59 Pr >z= 0.558 
Sargan test of overid. restrictions: chi2(12) = 23.35 Prob > chi2 = 0.025 
(Not robust, but not weakened by many instruments.) 
Hansen test of overid. restrictions: chi2(12) = 16.12 Prob > chi2 = 0.186 
(Robust, but weakened by many instruments.) 
Difference-in-Hansen tests of exogeneity of instrument subsets: 
GMM instruments for levels 
Hansen test excluding group: chi2(8) = 12.52 Prob > chi2 = 0.129 
Difference (null H = exogenous): chi2(4) = 3.60 Prob > chi2 = 0.462 


The same number of instruments is used as in section 9.4.4. The number of time 
periods used is five rather than four because the default for the xtabond2 
command is to use the systems estimator of section 9.4.7, which adds a levels 
equation. This also leads to estimates that differ from those in section 9.4.4. The 
tests of overidentifying restrictions are included with the warning that they are 
less reliable in a setting with many instruments. 


Our second reported specification adds exogenous variables (occ south smsa 
ind), endogenous variables (wks L.wks ms union), and four FE time dummies 
(tdum3—tdumé) because estimation uses five years of data and one of the time 
dummies is dropped to avoid the dummy variables trap. Instruments are 
generated using the GMM style for lagged endogenous variables as instruments, as 
well as the Iv style applied to time dummies. Here the GMm-style option means 
that different lagged instruments apply to different periods, and the Iv-style 
option means that the exogenous variable is a common instrument for all periods. 
We obtain 


. * xtabond2: Two-step GMM for dynamic model with exogenous & endogenous vars. 
. xtabond2 lwage L.lwage L2.lwage occ south smsa ind wks L.wks ms union 
> tdum3-tdum6, gmmstyle(L.(L.lwage wks ms union)) 
> ivstyle(t, equation(level)) twostep small 
Favoring space over speed. To switch, type or click on mata: mata set matafavor 
> speed, perm. 
Warning: Two-step estimated covariance matrix of moments is singular. 
Using a generalized inverse to calculate optimal weighting matrix for two-step 
> estimation. 
Difference-in-Sargan/Hansen statistics may be negative. 


Dynamic panel-data estimation, two-step system GMM 


Group variable: id Number of obs = 2975 
Time variable : t Number of groups = 595 
Number of instruments = 73 Obs per group: min = 5 
F(14, 594) = 27435.06 avg = 5.00 
Prob > F = 0.000 max = 5 
lwage | Coefficient Std. err. t P>|t| [95% conf. interval] 
lwage 
Li. - 623573 . 0659086 9.46 0.000 -4941307 . 7530153 
L2. . 026329 .0461591 0.57 0.569 - .0643258 . 1169838 
occ - .3792783 . 085034 -4.46 0.000 -.5462821 -.2122744 
south -. 3445325 . 1500747 -2.30 0.022 -. 639274 -.049791 
smsa -.25237 . 1264818 -2.00 0.046 -.5007758 -.0039642 
ind . 2997457 .1158891 2.59 0.010 .0721435 . 527348 
wks 
aS -.0012231 0041107 -0.30 0.766 - .0092964 . 0068503 
Li. . 0009459 .0012505 0.76 0.450 -.0015101 . 0034019 
ms .0579029 .0518024 1.12 0.264 - .0438353 . 1596411 
union -.1356566 .0666732 -2.03 0.042 -.2666004 -.0047128 
tdum3 -.0963978 .0229935 -4.19 0.000 -.1415563 -.0512393 
tdum4 -.0717887 .017493 -4.10 0.000 -.1061443 -.0374332 
tdum5 -.0497818 .0139664 -3.56 0.000 -.0772113 -.0223522 
tdum6 - .0239232 .0101656 -2.35 0.019 -.0438882 -.0039583 
_cons 2.836121 .4131195 6.87 0.000 2.024768 3.647473 


Warning: Uncorrected two-step standard errors are unreliable. 


Instruments for first differences equation 
GMM-type (missing=0, separate instruments for each period unless collapsed) 
L(1/6).(L2.1lwage L.wks L.ms L.union) 
Instruments for levels equation 
Standard 
t 
_cons 
GMM-type (missing=0, separate instruments for each period unless collapsed) 
D.(L2.lwage L.wks L.ms L.union) 


Arellano-Bond test for AR(1) in first differences: z = -4.07 Pr >z= 0.000 

Arellano-Bond test for AR(2) in first differences: z = 0.94 Pr> z= 0.349 

Sargan test of overid. restrictions: chi2(58) = 66.44 Prob > chi2 = 0.209 
(Not robust, but not weakened by many instruments.) 

Hansen test of overid. restrictions: chi2(58) = 52.97 Prob > chi2 = 0.662 


(Robust, but weakened by many instruments.) 


Difference-in-Hansen tests of exogeneity of instrument subsets: 
GMM instruments for levels 
Hansen test excluding group: chi2(39) 
Difference (null H = exogenous): chi2(19) 
iv(t, eq(level)) 
Hansen test excluding group: chi2(57) 
Difference (null H = exogenous): chi2(1) 


30.70 Prob > chi2 = 
22.26 Prob > chi2 = 0.271 


fo) 
foe) 
N 
O 


46.66 Prob > chi2 = . 834 
6.31 Prob > chi2 = 0.012 


fo) 


The number of instruments has now jumped to 73, which makes the 
overidentification tests even less reliable. The endogenous and lagged 
endogenous variables have little explanatory power in this case. 


Our third example repeats the previous calculation but uses one of the options 
to limit the number of lags used to form instruments to 2 and 3. 


. * xtabond2: Two-step GMM for dynamic model with limit on the lags generating IV 
. xtabond2 lwage L.lwage L2.lwage occ south smsa ind wks L.wks ms union 
> tdum3-tdum6, gmmstyle(L.(L.lwage wks ms union), laglimits(2 3)) twostep small 
Favoring space over speed. To switch, type or click on mata: mata set matafavor 
> speed, perm. 
Warning: Two-step estimated covariance matrix of moments is singular. 
Using a generalized inverse to calculate optimal weighting matrix for two-step 
> estimation. 
Difference-in-Sargan/Hansen statistics may be negative. 


Dynamic panel-data estimation, two-step system GMM 


Group variable: id Number of obs = 2975 
Time variable : t Number of groups = 595 
Number of instruments = 42 Obs per group: min = 5 
F(14, 594) = 19203.93 avg = 5.00 
Prob > F = 0.000 max = 5 
lwage | Coefficient Std. err. t P>|t| [95% conf. interval] 
lwage 
Li. 4911215 . 122592 4.01 0.000 . 2503549 . 731888 
L2. - 0673887 . 1155809 0.58 0.560 -.1596082 . 2943856 
occ - .4276428 . 1332561 -3.21 0.001 -.6893532 -.1659324 
south -.4102166 . 2282793 -1.80 0.073 - .8585493 .0381162 
smsa -.2879704 . 219564 -1.31 0.190 -.7191866 . 1432458 
ind . 1600471 . 2229759 0.72 0.473 -.2778698 .5979641 
wks 
are -.000818 .0100056 -0.08 0.935 - .0204687 .0188327 
Li. 0106464 .005731 1.86 0.064 - .000609 .0219019 
ms . 0832632 .110451 0.75 0.451 -. 1336587 .3001851 
union -. 1824594 .1151066 -1.59 0.113 - . 40852438 . 043606 
tdum3 -.1302655 .0426063 -3.06 0.002 -.2139427 -.0465882 
tdum4 -.1014275 .032352 -3.14 0.002 -.1649657 -.0378893 
tdum5 -.0712044 .0241933 -2.94 0.003 -.1187193 -.0236896 
tdum6 - .0377894 .0135913 -2.78 0.006 -.0644822 -.0110966 
_cons 3.115346 . 8056482 3.87 0.000 1.53308 4.697611 


Warning: Uncorrected two-step standard errors are unreliable. 


Instruments for first differences equation 
GMM-type (missing=0, separate instruments for each period unless collapsed) 
L(2/3) .(L2.lwage L.wks L.ms L.union) 
Instruments for levels equation 
Standard 
_cons 
GMM-type (missing=0, separate instruments for each period unless collapsed) 
DL. (L2.1lwage L.wks L.ms L.union) 


Arellano-Bond test for AR(1) in first differences: z = -2.98 Pr >z= 0.003 
Arellano-Bond test for AR(2) in first differences: z= 0.21 Pr > z= 0.833 
Sargan test of overid. restrictions: chi2(27) = 36.10 Prob > chi2 = 0.113 
(Not robust, but not weakened by many instruments.) 
Hansen test of overid. restrictions: chi2(27) = 26.41 Prob > chi2 = 0.496 
(Robust, but weakened by many instruments.) 
Difference-in-Hansen tests of exogeneity of instrument subsets: 
GMM instruments for levels 
Hansen test excluding group: chi2(12) = 6.53 Prob > chi2 = 0.887 
Difference (null H = exogenous): chi2(15) = 19.88 Prob > chi2 = 0.177 


As a result, the number of instruments now falls to 42. The standard errors are 
now larger in several cases, so there is some loss of efficiency. 


9.4.10 Dynamic systems with fixed effects 


Arellano and Bond (1991) considered GMM estimation of a single dynamic 
equation with individual specific fixed effects. The earlier article by Holtz-Eakin, 
Newey, and Rosen (1988) actually considered GMM estimation of systems of 
dynamic equations, or vector autoregressive models, with individual specific 
fixed effects in each equation. 


The community-contributed pvar command (Abrigo and Love 2016) 
implements the systems estimator of Holtz-Eakin, Newey, and Rosen (1988). In 
addition to providing model estimates, the pvar command produces impulse- 
response functions. 


9.5 Long panels 


The methods to this point have focused on short panels. Now we consider 
long panels with many time periods. 


Sections 9.5.1—9.5.5 focus on the case of relatively few individuals (N is 
small and T — oo). Examples are data on a few regions, firms, or industries 
followed for many time periods. Then individual fixed effects, if desired, can 
be easily handled by including dummy variables for each individual as 
regressors. Instead, the focus is on more efficient GLS estimation under richer 
models of the error process than those specified in the short-panel case. 


Some subsequent subsections also consider the case where both N = co 
and T — oo. The section ends with brief coverage of panel data with unit 
roots and cointegration. 


9.5.1 Long-panel dataset 


The dataset used is a U.S. state—year panel from Baltagi, Griffin, and 

Xiong (2000) on annual cigarette consumption and price for U.S. states over 
30 years. The ultimate goal is to measure the responsiveness of per capita 
cigarette consumption to real cigarette prices. Price varies across states, due 
in large part to different levels of taxation, as well as over time. 


The original data were for N — 46 states and T = 30, and it is not clear 
whether we should treat N — go, aS we have done to date, or T — go, or 
both. This situation is not unusual for a panel that uses aggregated regional 
data over time. To make explicit that we are considering T — oo, we use 
data from only NV = 10 States, similar to many countries where there may be 
about 10 major regions (states or provinces). 


mus209cigar.dta has the following data: 


. * Description of cigarette dataset 
. qui use mus209cigar, clear 


. describe 


Contains data from mus209cigar.dta 


Observations: 300 A.C.Cameron & P.K.Trivedi 
(2022): Microeconometrics Using 
Stata, 2e 
Variables: 6 1 Sep 2020 16:38 
Variable Storage Display Value 
name type format label Variable label 
state float %9.0g U.S. state 
year float “49 .0g Year 1963 to 1992 
lnp float “49 .0g Log state real price of pack of 
cigarettes 
lnpmin float “49 .0g Log of min real price in adjoining 
states 
lnc float %9.0g Log state cigarette sales in packs 
per capita 
lny float %9.0g Log state per capita disposable 
income 
Sorted by: 


There are 300 observations, so each state—year pair is a separate observation 
because 10 x 30 = 300. The quantity demanded (1nc) will depend on price 
(1np), price of a substitute (1npmin), and income (1ny). 


Descriptive statistics can be obtained by using summarize: 


* Summary of cigarette dataset 
summarize, separator (6) 


Variable Obs Mean Std. dev. Min Max 
state 300 5.5 2.87708 1 10 
year 300 77.5 8.669903 63 92 

lnp 300 4.518424 . 1406979 4.176332 4.96916 
lnpmin 300 4.4308 . 1379243 4.0428 4.831303 
lnc 300 4.792591 . 2071792 4.212128 5.690022 

lny 300 8.731014 . 6942426 7 . 300023 10.0385 


The variables state and year have the expected ranges. The variability in 
per capita cigarette sales (1nc) is actually greater than the variability in price 
(1np), with respective standard deviations of 0.21 and 0.14. All variables are 
observed for all 300 observations, so the panel is indeed balanced. 


9.5.2 Pooled OLS and pooled feasible GLS 


A natural starting point is the two-way-—effects model 

Yit = Qi + yt + Xk B + Ei. When the panel has few individuals relative to 
the number of periods, the individual effects a; (here state effects) can be 
incorporated into x;z as dummy-variable regressors. Then there are too many 
time effects 7 (here year effects). Rather than trying to control for these in 
ways analogous to the use of xt reg in the short-panel case, we can usually 
take advantage of the natural ordering of time (as opposed to individuals) 
and simply include a linear or quadratic trend in time. 


We therefore focus on the pooled model 


SO a, SS Tce Ne. PS Geos? 


where the regressors X;z include an intercept, often time and possibly time 
squared, and possibly a set of individual indicator variables. We assume that 
N is quite small relative to T. 


We consider pooled OLs and pooled feasible generalized least-squares 
(PFGLS) of this model under a variety of assumptions about the error uit. In 
the short-panel case, it was possible to obtain standard errors that control for 
serial correlation in the error without explicitly stating a model for serial 
correlation. Instead, we could use cluster—robust standard errors, given a 
small T and N — œ. Now, however, T is large relative to N, and it is 
necessary to specify a model for serial correlation in the error. Also, given 
that NV is small, it is possible to relax the assumption that wiz is independent 
over į. 


9.5.3 The xtpcse and xtgls commands 


The xtpcse and xtgls commands are more suited than xtgee for pooled OLS 
and GLS when data are from a long panel. They allow the error uj; in the 
model to be correlated over i, allow the use of an AR(1) model for Uit over t, 
and allow uit to be heteroskedastic. At the greatest level of generality, 


Uit = PiUit—1 F Eit (9.5) 


where €;¢ are serially uncorrelated but are correlated over ¿į with Cor 
(Eit, Eis) = Ots: 


The xtpcse command yields (long) panel-corrected standard errors for 
the pooled OLS estimator, as well as for a pooled least-squares estimator with 
an AR(1) model for wiz. The syntax is 


xtpcse depvar | indepvars | [ of | [ in | [ weight | E options ] 


The correlation () option determines the type of pooled estimator. 
Pooled ors is obtained by using correlation (independent). The pooled 
AR(1) estimator with general p; is obtained by using correlation (psar1). 
With a balanced panel, yit — PiYit,t—1 1S regressed on x*, = Xit — PXit,t—1 
for t > 1, whereas ,/(1 — p;)?y;1 is regressed on ,/(1 — p;i)? X; for t = 1. 
The pooled estimator with AR(1) error and p;i = P is obtained by using 
correlation (ar1). Then p, calculated as the average of the p;, is used. 


In all cases, panel-corrected standard errors that allow heteroskedasticity 
and correlation over 7 are reported, unless the hetonly option is used, in 
which case independence over į is assumed, or the independent option is 
used, in which case £ is 1.1.d. 


The xtgls command goes further and obtains PFGLS estimates and 
associated standard errors assuming the model for the errors is the correct 
model. The estimators are more efficient asymptotically than those from 
xtpcse, if the model is correctly specified. The command has the usual 
syntax: 


xtgls depvar | indepvars | [ of | [ in | | weight | E options | 


The panels () option specifies the error correlation across individuals, 
where for our data an individual is a state. The panels (iid) option specifies 
uit to be 1.1.d., in which case the pooled OLS estimator is obtained. The 
panels (heteroskedastic) option specifies ui: to be independent with a 
variance of E'(u?,) = a? that can be different for each individual. Because 
there are many observations for each individual, g? can be consistently 


estimated. The panels (correlated) option additionally allows correlation 
across individuals, with independence over time for a given individual, so 
that E (uitujt) = cij. This option requires that T > N. 


The corr() option specifies the serial correlation of errors for each 
individual state. The corr (independent) option specifies ui: to be serially 
uncorrelated. The corr (ar1) option permits AR(1) autocorrelation of the 
error with Uit = PUit—1 + Eit, Where Ei is 1.1.d. The corr (psar1) option 
relaxes the assumption of a common AR(1) parameter to allow 
Uit = PiUit—1 + Eit. The rhotype() option provides various methods to 
compute this AR(1) parameter. The default estimator is two-step feasible GLs, 
whereas the igls option uses iterated feasible GLS. The force option enables 
estimation even if observations are unequally spaced over time. 


Additionally, we illustrate the community-contributed xtscc command 
(Hoechle 2007). This generalizes xtpcse by applying the method of Driscoll 
and Kraay (1998) to obtain Newey—West-type standard errors for pooled 
OLS that allow autocorrelated errors of general form, rather than restricting 
errors to be AR(1). Error correlation across panels, often called spatial 
correlation, is assumed. The error is allowed to be serially correlated for m 
lags. The default is for the program to determine m. Alternatively, m can be 
specified using the lags (m) option. 


9.5.4 Application of the xtgls, xtpcse, and xtscc commands 


As an example, we begin with a PFGLS estimator that uses the most flexible 
model for the error uit, with flexible correlation across states and a distinct 
AR(1) process for the error in each state. In principle, this is the best 
estimator to use, but in practice, when T is not much larger than N, there 
can be finite-sample bias in the estimators and standard errors; see Beck and 
Katz (1995). Then it is best, at the least, to use the more restrictive 

corr (arl) rather than corr (psar1). 


We obtain 


. * Pooled GLS with error correlated across states and state-specific AR(1) 
. xtset state year 


Panel variable: state (strongly balanced) 
Time variable: year, 63 to 92 
Delta: 1 unit 


. xtgls Inc lnp lny lnpmin year, panels(correlated) corr(psar1) 


Cross-sectional time-series FGLS regression 


Coefficients: generalized least squares 
Panels: heteroskedastic with cross-sectional correlation 
Correlation: panel-specific AR(1) 
Estimated covariances s 55 Number of obs = 300 
Estimated autocorrelations = 10 Number of groups = 10 
Estimated coefficients = 5 Time periods 3 30 
Wald chi2(4) = 342.15 
Prob > chi2 = 0.0000 
lnc | Coefficient Std. err. Zz P>lz| [95% conf. interval] 
lnp -. 3260683 .0218214 -14.94 0.000 -.3688375  -.2832991 
lny - 4646236 .0645149 7.20 0.000 . 3381768 .5910704 
lnpmin 0174759 0274963 0.64 0.525 -.0364159 0713677 
year - .0397666 0052431 -7.58 0.000 -.0500429 -.0294902 
_cons 5.157994 . 2753002 18.74 0.000 4.618416 5.697573 


All regressors have the expected effects. The estimated price elasticity of 
demand for cigarettes is — 0.326; the income elasticity is 0.465; demand 
declines by 4% per year (the coefficient of year is a semielasticity because 
the dependent variable is in logs); and a higher minimum price in adjoining 
states increases demand in the current state. There are 10 states, so there are 
10 x 11/2 = 55 unique entries in the 10 x 10 contemporaneous error 
covariance matrix, and 10 autocorrelation parameters /; are estimated. 


We now use xtpcse, xtgls, and community-contributed xtscc to obtain 
the following pooled estimators and associated standard errors: 1) pooled 
OLS with 1.1.d. errors; 2) pooled OLS with standard errors assuming correlation 
over states; 3) pooled OLS with standard errors assuming general serial 
correlation in the error (to four lags) and correlation over states; 4) pooled 
OLS that assumes an AR(1) error and then gets standard errors that 
additionally permit correlation over states; 5) PFGLS with standard errors 
assuming an AR(1) error; and 6) PFGLS assuming an AR(1) error and 
correlation across states. In all cases of AR(1) error, we specialize to Pi = P. 


* Comparison of various pooled OLS and GLS estimators 
. qui xtpcse lnc lnp lny lnpmin year, corr(ind) independent nmk 


. estimates store OLS_iid 

. qui xtpcse lnc lnp lny lnpmin year, corr(ind) 

. estimates store OLS_cor 

. qui xtscc lnc lnp lny lnpmin year, lag(4) 

. estimates store OLS_DK 

. qui xtpcse lnc lnp lny lnpmin year, corr(ar1) 

. estimates store ARi_cor 

. gui xtgls lnc lnp lny lnpmin year, corr(ar1) panels(iid) 

. estimates store FGLSAR1 

. qui xtgls lnc lnp lny Inpmin year, corr(ar1) panels(correlated) 
. estimates store FGLSCAR 

. estimates table OLS_iid OLS_cor OLS_DK AR1i_cor FGLSAR1 FGLSCAR, b(%7.3f) se 


Variable OLS_iid OLS_cor OLS_DK AR1_cor FGLSAR1 FGLSCAR 


lnp -0.583 -0.583 -0.583 -0.266 -0.264 -0.330 
0.129 0.169 0.279 0.049 0.049 0.026 

lny 0.365 0.365 0.365 0.398 0.397 0.407 
0.049 0.080 0.167 0.125 0.094 0.080 

lnpmin -0.027 -0.027 -0.027 0.069 0.070 0.036 
0.128 0.166 0.258 0.064 0.059 0.034 

year -0.033 -0.033 -0.033 -0.038 -0.038 -0.037 
0.004 0.006 0.012 0.010 0.007 0.006 

_cons 6.930 6.930 6.930 5.115 5.100 5.393 
0.353 0.330 0.527 0.544 0.414 0.361 


Legend: b/se 


For pooled OLs with 1.1.d. errors, the nmk option normalizes the VCE by 
N — K rather than N, so that output is exactly the same as that from 
regress with default standard errors. The same results could be obtained by 
using xtgls with the corr (ind) panel (iid) nmk options. Allowing 
correlation across states increases OLS standard errors by 30-50%. 
Additionally, allowing for serial correlation (oLs_DK) leads to another 50— 
100% increase in the standard errors. The fourth and fifth estimators control 
for at least an AR(1) error and yield roughly similar coefficients and standard 
errors. The final column results are similar to those given at the start of this 
section, where we used the more flexible corr (psar1) rather than 


corr(arl). 


9.5.5 FE and RE models 


As noted earlier, if there are few individuals and many time periods, 
individual-specific FE models can be fit with the least-squares dummy 
variable approach of including a set of dummy variables, here for each time 
period (rather than for each individual as in the short-panel case). 


Alternatively, one can use the xt regar command. This model is the 
individual-effects model y; = a; + X4, B + Uit, with AR(1) error 
Uit = PUit—1 + Eit. This is a better model of the error than the 1.1.d. error 
model wiz = Eit assumed in xt reg, SO xtregar potentially will lead to more 
efficient parameter estimates. 


The syntax of xtregar is similar to that for xtreg. The two key options 
are fe and re. The fe option treats a; as a fixed effect. Given an estimate of 
p, we first transform to eliminate the effect of the AR(1) error, as described 
after (9.5), and then transform again (mean-difference) to eliminate the 
individual effect. The re option treats a; as a random effect. 


We compare pooled OLS estimates, RE estimates using xtreg and 
xtregar, and FE estimates using xtreg, xtregar, and xtscc. Recall that 
xtscc calculates either the OLS or regular within estimator but then estimates 
the VCE assuming quite general error correlation over time and across states. 
We have 


. * Comparison of various RE and FE estimators 
. qui use mus209cigar, clear 


. qui xtscc lnc lnp lny lnpmin, lag(4) 


. estimates store OLS_DK 
. qui xtreg lnc lnp lny lnpmin, fe 
. estimates store FE_REG 
. qui xtreg lnc lnp lny lnpmin, re 


. estimates store RE_REG 


. qui xtregar lnc lnp lny lnpmin, fe 
estimates store FE_REGAR 


. qui xtregar lnc lnp lny lnpmin, re 
. estimates store RE_REGAR 


. qui xtscc Inc lnp lny lnpmin, fe lag(4) 


. estimates store FE_DK 


. estimates table OLS_DK FE_REG RE_REG FE_REGAR RE_REGAR FE_DK, b(%7.3f) se 


Variable OLS_DK FE_REG RE_REG FE_RE7R RE_RE~R FE_DK 
lnp -0.611 -1.136 -1.110 -0.260 -0.282 -1.136 
0.438 0.101 0.102 0.049 0.052 0.162 
lny -0.027 -0.046 -0.045 -0.066 -0.074 -0.046 
0.027 0.011 0.011 0.064 0.026 0.020 
lnpmin -0.129 0.421 0.394 -0.010 -0.004 0.421 
0.346 0.101 0.102 0.057 0.060 0.172 
_cons 8.357 8.462 8.459 6.537 6.708 8.462 
0.647 0.241 0.247 0.036 0.289 0.474 

Legend: b/se 


There are three distinctly different sets of coefficient estimates: those using 
pooled ors, those using xt reg to obtain FE and RE estimators, and those using 
xtregar to obtain FE and RE estimators. The final set of estimates uses the fe 
option of the community-contributed xtscc command. This produces the 
standard within estimator but then finds standard errors that are robust to 
both spatial (across panels) and serial autocorrelation of the error. 


9.5.6 Interactive effects 


For panel data with many individuals and many time periods, Bai (2009) 
proposed a model with interactive effects that is a richer model than one 


with additive individual-specific effects and time-specific effects. 


The interactive effects model specifies 
Yit = Xab + Afe + Eit 


where f, is a vector of unobserved common effects or common time shocks 
and A; is a corresponding vector of weights that vary at the individual level. 
In factor-analysis terminology, f; is a factor and A; are factor loadings. The 
special case A; = (1,a;) and f} = (1, ô+) yields A’ f, = a; + ô+. The 
idiosyncratic error €;; may be serially and cross-sectionally dependent. 


Stacking observations over all time periods for a given individual yields 
model Y; = X;3+ FA; + e;. Then 8, F, A minimize 
do, (Yi — XB — Fi) (Y; — X; — FA;), where to ensure parameter 
identification, we normalize F’/F = I and A’A diagonal. Asymptotic theory 
requires N — oo and T — ov, with inference varying according to whether 
T/N — 0or N/T —> 0 or T/N > c for some constant c > 0. 


The interactive-effects model can be fit using the community-contributed 
command regi fe (Gomez 2017). 


9.5.7 Separate regressions 


The pooled regression specifies the same regression model for all individuals 
in all years. Instead, we could have a separate regression model for each 
individual unit: 


/ 
Yit = x38; + Uit 


This model has N K parameters, so inference is easiest for a long panel with 
a small N. 


For example, suppose for the cigarette example we want to fit separate 
regressions for each state. Separate OLS regressions for each state can be 
obtained by using the statsby prefix with the by (state) option. We have 


* Run separate regressions for each state 
statsby, by(state) clear: regress lnc lnp lny lnpmin year 
(running regress on estimation sample) 


Command: regress lnc lnp lny lnpmin year 
By: state 


Statsby groups 


— p 7 eo 


This leads to a dataset with 10 observations on state and the 5 regression 
coefficients. We have 

. * Report regression coefficients for each state 

. format _b* %9.2f 


. list, clean 


state _b_linp _b_liny _b_lnp~n -_b_year _b_cons 
1. 1 -0.36 1.10 0.24 -0.08 2.10 
2. 2 0.12 0.60 -0.45 -0.05 5.14 
3. 3 -0.20 0.76 0.12 -0.05 2.72 
4. 4 -0.52 -0.14 -0.21 -0.00 9.56 
5. 5 -0.55 0.71 0.30 -0.07 4.76 
6. 6 -0.11 0.21 -0.14 -0.02 6.20 
7. 7 -0.43 -0.07 0.18 -0.03 9.14 
8. 8 -0.26 0.89 0.08 -0.07 3.67 
9. 9 -0.03 0.55 -0.36 -0.04 4.69 
10. 10 -1.41 1.12 1.14 -0.08 2.70 


In all states except one, sales decline as price rises, and in most states, 
sales increase with income. 


One can also test for poolability, meaning to test whether the parameters 
are the same across states. In this example, there are 5 x 10 = 50 parameters 
in the unrestricted model and 5 in the restricted pooled model, so there 
would be 45 restrictions to test. 


9.5.8 Heterogeneous panels 


Pesaran and Smith (1995) proposed aggregating the separate regressions to 
perform inference on E(6;) by using the mean group (MG) estimator 


B= (1/N) ON, B: 


Consistency of the MG estimator requires that regressors be uncorrelated 
with the error. Pesaran (2006) relaxed this assumption by introducing 
interactive effects. The model is the same as the interactive effects model of 
Bai (2009), except that the parameters G, vary across individuals. Pesaran 
shows that, for large N and T, 8; can be consistently estimated by adding as 
regressors the time averages of Yit and Xit and obtaining for each individual 
OLS estimates of the model 


Yit = Xabi + VV, XTi H Uit 


The common correlated estimator (CCE) of E(G;,) is then 


B = (1/N) ye D; 


The CCE estimator treats the factor variables f, as nuisance parameters, 
but in some applications estimates of f, are of intrinsic interest. The 
augmented MG estimator of Eberhardt and Teal (2010) extends the CCE 
estimator by providing estimates of f,. 


Pesaran and Smith (1995) proposed a pooled mean group estimator that 
extends the CCE estimator to dynamic models with regressors that potentially 
follow a unit-root process. 


The community-contributed xtpmg command (Blackburne and 
Frank 2007) implements the MG and pooled mean group estimators. The 
community-contributed xtmg command (Eberhardt 2012) implements the 
MG, CCE, and augmented MG estimators. 


9.5.9 Unit roots and cointegration 


For time series that are nonstationary because of unit roots, estimators no 
longer have an asymptotic normal distribution as T —+ oo; see, for example, 
Greene (2018, chap. 21). Similar nonstandard distributions arise for panel 


data with unit roots when asymptotic theory relies on T — oo; see 
Baltagi (2021, chap. 12). 


If N is small, say, N < 10, then seemingly unrelated equations methods 
can be used. When N is large, the panel aspect becomes more important. 
Complications include the need to control for cross-sectional unobserved 
heterogeneity when JV is large, asymptotic theory that can vary with exactly 
how WN and T both go to infinity, and the possibility of cross-sectional 
dependence. At the same time, statistics that have nonnormal distributions 
for a single time series can be averaged over cross-sections to obtain 
statistics with a normal distribution. 


Unit-root tests can have low power. Panel data may increase the power 
because of now having time series for several cross-sections. The unit-root 
tests can also be of interest per se, such as testing purchasing power parity, 
as well as being relevant to consequent considerations of cointegration. A 
dynamic model with cross-sectional heterogeneity is 


Yit = PiYie—1 + Or Ayi t-1 + >>> + dip, AYiz—p, + Zit Yi + Uit 


where lagged changes are introduced so that wiz is 1.1.d. Examples of Zit 
include individual effects, with z;, = (1); individual effects and individual 
time trends, with z; = (1 t)’, and yi = 7 in the case of homogeneity. A unit- 
root test is a test of Ho: p1 =---pn = 1. Levin, Lin, and Chu (2002) 
proposed a test against the alternative of homogeneity, 

Ay: pı =: = pn = p < 1, that is based on pooled OLS estimation using 
specific first-step pooled residuals, where in both steps homogeneity (pi = p 
and ik = x) is imposed. 


A number of different unit-root tests with panel data have been proposed, 
including tests that relax the assumption of independence across individuals 
to allow correlation. The xtunitroot command provides six different panel 
unit-root tests. 


As in the case of a single time series, cointegration tests are used to 
ensure that statistical relationships between trending variables are not 


spurious. A quite general cointegrated panel model is 


/ 1 
Yit = Xubi + Zit Yi + Uit 
Xit = Xi t—1 + Eit 


where Zit is deterministic and can include individual effects and time 
trends and X;z are (co)integrated regressors. 


Most tests of cointegration are based on the OLS residuals &;;, but the 
unit-root tests cannot be directly applied if Cov(wiz, €i¢) 4 0, as is likely. 
The xtcointtest command provides three different panel cointegration 


tests. 


Single-equation estimators have been proposed that generalize to panels 
fully modified OLS and dynamic OLS, and Johanssen’s system approach has 
also been generalized to panels. 


For further details, see the books by Baltagi (2021) and Pesaran (2015). 


9.6 Additional resources 


The major reference is [XT] Stata Longitudinal/Panel-Data Reference 
Manual, especially [XT] xtivreg, [XT] xthtaylor, and [xT] xtabond. The 
community-contributed xt ivreg2 command (Schaffer 2005) provides a 
broad range of Iv-related estimators and associated tests. For estimation 
with long panels, a useful Stata community-contributed command is xtscc, 
as well as several others mentioned in section 9.5. Fernandez-Val and 
Weidner (2018) survey finite-7 and finite- N bias for fixed-effects 


cross-sectional dependence. Chiang, Hansen, and Sasaki (2021) propose an 
extension to two-way cluster—robust standard errors that is also robust 
against common time effects. 


Many of the topics in this chapter appear in more specialized books on 
Lee (2002), and Pesaran (2015). Cameron and Trivedi (2005) present many 
of the methods in this chapter. 


9.7 Exercises 


1. For the model and data of section 9.2, obtain the panel Iv estimator in 
the FE model by applying the ivregress command to the mean- 
differenced model with a mean-differenced instrument. Hint: For 
example, for variable x, type by id: egen avex = mean (x) followed by 
summarize x and then generate mdx = x-avextr (mean). Verify that 
you get the same estimated coefficients as you would with xtivreg, 
fe. 

2. For the model and data of section 9.4, use the xtdpdsys command 
given in section 9.4.6, and then perform specification tests using the 
estat abond and estat sargan commands. Use xtdpd at the end of 
section 9.4.8, and compare the results with those from xtdpdsys. Is 
this what you expect, given the results from the preceding specification 
tests? 

3. Consider the model and data of section 9.4, except consider the case of 
just one lagged dependent variable. Throughout, estimate the 
parameters of the models with the noconstant option. Consider 
estimation of the dynamic model Yit = Q; + YYit—1 + Eit, when T = 7 
, where €i¢ are serially uncorrelated. Explain why OLS estimation of the 
transformed model Ayit = Y1 Ayje_1 + Aciz, t = 2,..., 7, leads to 
inconsistent estimation of 71. Propose an instrumental-variable 
estimator of the preceding model where there is just one instrument. 
Implement this just-identified Iv estimator using the data on 1wage and 
the ivregress command. Obtain cluster—robust standard errors. 
Compare with OLS estimates of the differenced model. 

4. Continue with the model of the previous question. Consider the 
Arellano—Bond estimator. For each time period, state what instruments 
are used by the estat abond command. Perform the Arellano—Bond 
estimator using the data on 1wage. Obtain the one-step estimator with 
robust standard errors. Obtain the two-step estimator with robust 
standard errors. Compare the estimates and their standard errors. Is 
there an efficiency gain compared with your answer in the previous 
question? Use the estat abona command to test whether the errors Eit 
are serially uncorrelated. Use the estat sargan command to test 
whether the model is correctly specified. 


Chapter 10 
Introduction to nonlinear regression 


10.1 Introduction 


This chapter provides a brief introduction to nonlinear regression methods 
that are presented in much more detail in subsequent chapters. 


We focus on the most commonly used nonlinear-in-parameters models 
in microeconometrics, the probit and logit models for binary outcomes that 
take only two values. Examples include whether a commuter uses a car or 
uses other means of transport, whether a person is employed, and whether a 
person visits a doctor. 


Unlike the case for linear models, the marginal effect (ME) of a change 
in a regressor is no longer simply the associated slope parameter. This can 
make direct interpretation of parameter estimates difficult, so interpretation 
of results relies on estimated MEs. And estimation uses iterative numerical 
methods because there is no closed-form solution for parameter estimates. 


For standard nonlinear models, simply changing the command from 
regress y x tO probit y x, for example, leads to nonlinear estimation and 
regression output that looks essentially the same as the output from 
regress. The p-values and confidence intervals given are only 
approximate, however, unless the sample size is very large. MEs can be 
computed using the margins postestimation command. 


10.2 Binary outcome models 


We consider probit regression and logit regression for whether a person 
visits a doctor. There is no need to first read the separate chapter 17 on 
binary outcome models because any necessary background is provided here. 


10.2.1 Doctor visit example 


Data on office-based physician visits by persons in the United States aged 
25—64 years come from the 2002 Medical Expenditure Panel Survey. The 
sample is the same as that used by Deb, Munkin, and Trivedi (2006). It 
excludes those receiving public insurance (Medicare and Medicaid) and is 
restricted to those working in the private sector but not self-employed. 


The dependent variable (visit) is a binary variable equal to 1 if the 
person visited a doctor at least once and equal to 0 if the person did not. The 
regressors used here are restricted to health insurance status (private), 
health status (chronic) and two socioeconomic characteristics (female and 
income) to keep Stata output short. We have 


. * Read in dataset, select one year of data, and describe key variables 
. qui use mus210mepsdocvisyoung 


. qui keep if year02 == 
. generate visit = docvis > 0 
. label variable visit "= 1 if doctor visit" 


. describe visit private chronic female income 


Variable Storage Display Value 

name type format label Variable label 
visit float %9.0g = 1 if doctor visit 
private byte 48 .0g Private insurance 
chronic byte 48 .0g Chronic condition 
female byte 78 .0g Female 
income float %9.0g Income in $ / 1000 


Summary statistics for these variables follow. 


. * Summmary of key variables 
. Summarize visit private chronic female income 


Variable Obs Mean Std. dev. Min Max 
visit 4,412 .6359927 -4812052 (0) 1 
private 4,412 . 7853581 .4106202 (0) 1 
chronic 4,412 . 3263826 . 4689423 (0) 1 
female 4,412 .4718948 .4992661 (0) 1 
income 4,412 34.34018 29.03987 -49.999 280.777 


From the summary statistics, 64% of the sample see a doctor at least once 
(visit=1), 79% have private health insurance, 33% have a chronic 
condition, 47% are female, and average annual income is $34,340. We use 
all the sample, including the three people who from command tabulate 
income (output not given) have negative income. 


10.2.2 Probit and logit model definition 


For binary outcome models, the dependent variable y takes just two values, 
set to 1 and 0 for simplicity. It is then clear that the distribution of y is the 
Bernoulli, the same as that for a coin toss. If the probability that y = 1 is p, 
then the probability that y = 0 is necessarily 1 — p. 


Binary outcome regression models allow p to vary with the regressors x. 
A linear model p = x’@ does not restrict the probability to lie between zero 
and one. Better models set p = F(x’), where F(-) is a known function 
with the property that 0 < F(-) < 1. Any cumulative distribution function 
(c.d.f.) F(z) for variable z continuous on (—oo, oo) has this desired 
property. The probit and logit models differ in this specified function F (-). 


The probit regression model specifies that 
Pr(y; = 1|x;) = ® (x;8) 


where ®(.) is the standard normal c.d.f., defined as 
(z) = 20nd. 


The logit regression model specifies 
Pr(y; = 1[x:) = A (x48) (10.1) 


where A(-) is the logistic c.d.f. and A(z) = e*/(1 + e7). 


As detailed in section 10.4, logit and probit models give different 
parameter estimates because of the different choice of function. But the two 
models produce very similar model fit, predicted probabilities, and MEs. 
Applied economists most often use the probit model, while in other areas of 
applied statistics such as biostatistics, the logit model is most frequently 
used. 


The probit and logit models are nonlinear functions of the regressors, so 
the ME on Pr(y = 1|x) of a change in the jth regressor is no longer simply 8; 
; it additionally depends on the value of the regressors x. For the probit 
model, for example, the ME equals 2,,(x’), where ¢(-) is the standard 
normal density. 


Consider the simple probit model with a scalar regressor x and Pr 
(y = 1|x) = 6(0.5 + x). The first panel of figure 10.1 plots Pr(y = 1|z) 
against x. The relationship is clearly nonlinear. The second panel of 
figure 10.1 plots the ME on Pr(y = 1|x) of a change in x. The ME clearly 
varies with the value of x. Computation of MEs using the margins command 
is detailed below. 


Probit model with Pr(y=1) = ©(0.5+x) ME in probit model 
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Figure 10.1. Probit model and associated ME 


The probit and logit estimators are obtained by maximum likelihood 
methods, detailed in section 17.3. Consistency requires that the function Pr 
(y = 1|x) be correctly specified. Once this function is correctly specified, 
the entire distribution is correctly specified because y takes only two values 
and Pr(y = 0|x) = 1 — Pr(y = 1|x) necessarily because probabilities sum 
to 1. 


10.3 Probit model 


We focus on probit regression. Qualitatively similar analysis holds for logit 
regression, presented in section 10.5. 


10.3.1 The probit command 


The probit ML estimator is obtained using the probit command. The syntax 
of the command is 


probit depvar | indepvars | [ af | [ in | | weight | ie options | 


This syntax is the same as that for command regress. Usually, there is no 
need to use any of the command options, aside from the vce (vcetype) 
option. 


10.3.2 Probit estimation results 


We use the probit command, with the vce (robust) option that provides 
heteroskedastic-robust standard errors; see section 10.3.3 for discussion of 
which standard errors to use. 


This yields the following results for whether a person visited the doctor. 


. * Probit regression (command probit) with robust standard errors 
. probit visit private chronic female income, vce(robust) 


Iteration 0: log pseudolikelihood = -2892.9 
Iteration 1: log pseudolikelihood = -2337.6553 
Iteration 2: log pseudolikelihood = -2331.8213 
Iteration 3: log pseudolikelihood = -2331.8084 
Iteration 4: log pseudolikelihood = -2331.8084 


Probit regression Number of obs = 4,412 

Wald chi2(4) = 910.77 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -2331.8084 Pseudo R2 = 0.1940 
Robust 

visit | Coefficient std. err. Zz P>|z| [95% conf. interval] 

private . 7663244 .0528282 14.51 0.000 - 6627832 . 8698657 

chronic 1.064481 0511394 20.82 0.000 . 9642499 1.164713 

female . 5529806 .0434216 12.74 0.000 -4678759 .6380854 

income 0056173 .0008671 6.48 0.000 .0039178 .0073169 

_cons -.9594421 .0525925 -18.24 0.000 -1.062521 -.8563626 


The output begins with an iteration log because the estimator is obtained 
numerically using an iterative procedure presented in section 16.2. In this 
case, four iterations are needed. Each iteration increases the log-likelihood 
function, as desired, and iterations cease when there is little change in the 
log-likelihood function. 


The remaining output from probit is remarkably similar to that for 
regress. The 4 regressors are jointly statistically significant at 5% because 
the Wald chi2(4) statistic has p = 0.00 < 0.05. This test is chi-squared 
distributed rather than F distributed because it relies on asymptotic theory. 
The pseudo- R2, discussed with other model diagnostics in section 13.8, does 
not have the same interpretation as R? in the linear regression model. There 
is no analysis-of-variance table because this table is appropriate only for 
linear least squares with independent and identically distributed errors. 


The remaining output indicates that all regressors are individually 
statistically significant at level 0.05 because all p-values are less than 0.05. 
For each regressor, the output presents in turn the following: 


Coefficients eF 


Standard errors 53, 
z statistics Zj = bileg, 
p-values p; = Pr{|z;] > Ole; ~N 0,11} 


95% confidence intervals B; + 1.96 x 53, 


The z statistics and p-values are computed using the standard normal 
distribution, rather than the t distribution with N — K degrees of freedom, 
because they are based on asymptotic normality. The p-values are for a two- 
sided test of whether 8; = 0. For a one-sided test of Ho: 6; < 0 against 

B; > 0, the p-value is half that reported in the table, provided that z; > 0. 
For a one-sided test of Ho: 6; > 0 against 6; < 0, the p-value is half that 
reported in the table, provided z; < 0. 


A nonlinear model raises a new issue of interpretation of slope 
coefficients 6;. For example, what does the value 0.0056 for the coefficient 
of income mean? We address this important issue in detail in section 10.4. 


10.3.3 Standard error computation 


As for linear regression, there are several ways to obtain standard errors, 
depending on the types of data being analyzed. Complete details, with 
formulas, are given in section 13.4. A brief summary is given here. 


Default standard errors are based on the assumption that a model is 
correctly specified. With real-world data, however, even a good model is 
unlikely to be exactly correctly specified. And standard errors are needed not 
only for estimates of model parameters but also for subsequent statistics that 
are calculated such as MEs. 


For data that are independent across observations, it is standard to 
instead obtain heteroskedastic—robust standard errors, which for most 
commands are obtained using the vce (robust) option of a cross-sectional 
estimation command. With data that are clustered, with observations 
correlated if in the same cluster but independent if in different clusters, it is 
standard to obtain cluster—robust standard errors. These can be obtained by 


using the vce (cluster clustvar) option and, more simply by using the 
vce (robust) for commands such as xt commands, which are for clustered 


data. 


The preceding probit results used the vce (robust) option. Using 
default standard errors instead, we obtain 


. * Probit regression (command probit) with default standard errors 
. probit visit private chronic female income, nolog 


Probit regression Number of obs = 4,412 
LR chi2(4) = 1122.18 

Prob > chi2 = 0.0000 

Log likelihood = -2331.8084 Pseudo R2 = 0.1940 
visit | Coefficient Std. err. Zz P>|zl [95% conf. interval] 
private . 7663244 .0529517 14.47 0.000 -662541 .8701079 
chronic 1.064481 .0510181 20.86 0.000 . 9644878 1.164475 
female . 5529806 .0433727 12.75 0.000 -4679716 . 6379897 
income .0056173 0008175 6.87 0.000 0040151 .0072196 

_cons -.9594421 .0527427 -18.19 0.000 -1.062816 -.8560683 


The standard errors differ by at most 6.5% for income. For some commands, 
there can be little difference between default and heteroskedastic—robust, 
whereas for others, notably for poisson, there can be a very substantial 
difference. 


For data that are clustered, such as in short panels of independent 
individuals, or cross-sectional data with natural groupings, such as villages, 
one must use the option vce (cluster clustvar) to obtain cluster—robust 
standard errors. If observations are correlated within cluster, these correct 
standard errors can be much larger than the incorrect default standard errors 
or heteroskedastic—robust standard errors. For cluster—robust standard errors, 
the number of clusters should be large; see section 3.4.6. 


Stata estimation commands with the vce (bootstrap) option provide 
standard errors using the bootstrap. The default is a paired bootstrap, which 
will be discussed in more detail in section 12.8.1 for the linear regression 
model, that assumes independent observations. It is asymptotically 
equivalent to computing heteroskedastic—robust standard errors, provided the 
number of bootstraps is large. Similarly, a cluster bootstrap that assumes 


independence across clusters but not within cluster (the vce (bootstrap, 
cluster (clustvar)) option) is asymptotically equivalent to computing 
cluster—-robust standard errors, provided the number of bootstraps is large. 


Bootstrap methods and the jackknife method (the vce (jackknife) 
option) are detailed in chapter 12. In that chapter, we additionally consider a 
different use of the bootstrap to implement a more refined asymptotic theory 
that can lead to ¢ statistics with better size properties and confidence 
intervals with better coverage in finite samples. 


10.3.4 Postestimation commands 


The ereturn list command provides a list of what estimation results are 
stored in e () ; see section 1.6.2 for details following the regress command. 
Stored results include regression coefficients in e (b) and the estimated VCE 
in e (Vv). 


Standard postestimation commands available after most estimation 
commands, including nonlinear model commands such as the probit 
command, have already been given in table 3.1. Many of these 
postestimation commands have already been used in preceding chapters. In 
this chapter, we use the predict and margins commands. 


To find specific postestimation commands available after command 
probit, for example, see [R] probit postestimation or use command help 
probit postestimation. This lists additional postestimation commands that 
are specific to the probit command: estat classification, estat gof, 
Lroc, and lsens. 


10.3.5 Prediction 


The predict command with the pr option, the default option after the 
probit command, computes for each observation ®(x/ G3), the predicted 
probability that y = 1. 


We obtain the following predicted probabilities: 


. * Predicted probabilities from probit 
. qui probit visit private chronic female income, vce(robust) 


. predict phatprobit, pr 


summarize visit phatprobit 


Variable Obs Mean Std. dev. Min Max 
visit 4,412 .6359927 .4812052 (0) 1 
phatprobit 4,412 .6356915 . 2316533 . 168668 . 992829 


The predicted probabilities range from 0.169 to 0.993. The average predicted 
probability of 0.6357 is very close to the sample proportion 0.6360 of 
individuals who visited a doctor. 


10.4 MEs and coefficient interpretation 


An ME or a partial effect measures the effect of a change in one of the 
regressors, say, Xj. For nonlinear models, such as probit and logit, there are 
remarkably many different methods for calculating MEs. 


For the probit model, the ME of interest is that for the conditional 
probability that the event of interest happens, and the discussion below 
focuses on MEs for Pr(y = 1|x). The methods carry over immediately to the 
more common case of computing MEs for the conditional mean because in 
binary outcome models, E(y|x) = Pr(y = 1|x). 


While we focus on nonlinear models here, the discussion overlaps 
considerably with the methods for linear regression presented in section 4.5. 


10.4.1 Calculus method and finite-difference method 


Using calculus for the probit model, we see the ME of the jth regressor is 


_ OPr(y = 1|x) 
E Ox; 


ME; 


= $(x'B) 5; (10.2) 


This ME is not simply the relevant parameter (,, and it varies with the point of 
evaluation x. 


Calculus methods are not always appropriate. In particular, for an 
indicator variable, say, dq, the relevant ME is the discrete change in the 
conditional mean when q changes from ọ to 1. 


Let x = |z d], where z denotes all regressors other than the jth, which is 
an indicator variable d, and let Pr(y = 1|z,d) = ®(z’B, + God) . Then the 
finite-difference method yields ME 

ME; = Pr(y = 1|z,d = 1) — Pr(y = 1|z,d = 0) 
= © (2'B, + b2) — ®(2'B;) 


10.4.2 Average marginal effect, ME at mean, and ME at a representative 
value 


The ME varies with the point of evaluation x. Three common choices of 
evaluation are 1) at sample values and then average; 2) at the sample mean of 
the regressors; and 3) at representative values of the regressors. 


We use the following acronyms, where the first two follow Bartus (2005). 


AME Average ME Average of individual MEs 
MEM Marginal effect at mean ME at x = X 
MER Marginal effect at a representative value ME at x = x* 


The MEM and MER can differ substantially from the AME in nonlinear models. 


The ME averaged over individuals, the AME, is most commonly reported. 
The default is to compute a within sample average. A population AME can be 
obtained using the weight option of the margins command, provided that 
sampling weights are available. 


Sometimes, interest lies instead in the ME for the average individual; then 
the MEM is computed. And if interest lies in the ME for a particular 
representative individual, then the MER is computed. 


10.4.3 AME in treatment-effects models 


In the simplest treatment-effects examples, the treatment is a variable that 
takes one of two values according to whether an individual receives 
treatment. Interest lies in measuring the average treatment effect (ATE), the 
average difference across individuals in the outcome according to whether 
treated or not. 


In the current context of a parametric model, the treatment variable can be 
included as a binary regressor, and the AME for that regressor is the ATE. 


The treatment-effects literature presented in chapters 24 and 25 instead 
presents methods that focus on estimating the ATE, or the closely related ATE 


on the treated, under assumptions that can be weaker than those considered 
here. These assumptions do not necessarily require obtaining an unbiased or 
consistent estimate of the ME for each individual before averaging. For 
example, in a randomized experiment, we observe each individual’s outcome 
in only one of the two states of treatment or no treatment, so we cannot 
estimate the ME of treatment for a given individual. But if assignment is 
random, we can obtain the ATE in this context, as the difference in the average 
outcome of the two groups. 


10.4.4 The margins and margins, dydx commands 


The three MEs measures can be computed using the margins postestimation 
command. 


The syntax of the command of the margins command is 
margins | marginlist | lif | lin] | weight | [ » response_options options | 


where marginlist is a list of factor variables or interactions that appear in the 
current estimation results, vesponse_options specify the particular quantity to 
be computed, and options include particular values of regressors at which 
computation occurs. The margins command without any options computes 
the sample average value of the default quantity computed by the predict 
command. 


MES are computed using dydx (). The arguments can be a list of 
regressors, while ayax (*) computes the ME for all regressors. The default is 
to compute the AME. The option atmeans computes the MEM, and option at () 
computes the MER at specified values of the regressors. 


10.4.5 Probit model application 


We compute the average of MEs across individuals (the AME) for the 
previously fit probit model. 


. * AMEs using calculus method 
. qui probit visit private chronic female income, vce(robust) 


. margins, dydx(*) 

Average marginal effects Number of obs = 4,412 
Model VCE: Robust 

Expression: Pr(visit), predict() 

dy/dx wrt: private chronic female income 


Delta-method 


dy/dx std. err. Zz P>|z\ [95% conf. interval] 

private . 2291266 0146652 15.62 0.000 . 2003833 . 2578699 
chronic . 3182738 0131955 24.12 0.000 . 292411 . 3441366 
female . 165338 .0122877 13.46 0.000 . 1412546 . 1894215 
income .0016796 .0002555 6.57 0.000 .0011787 .0021804 


When averaged across individuals, the probability of visiting a doctor is 
0.229 higher, or 22.9 percentage points higher, for someone with private 
insurance. And a $1,000 increase in income, a one-unit change in variable 
income, is associated with a 0.00168 increase in the probability of visiting a 
doctor. The standard errors and associated z statistics and p-values are based 
on heteroskedastic—robust standard errors because the immediately preceding 
probit command used the vce (robust) option. 


These AMEs are computed using calculus methods. For binary regressors 
private, chronic, and female, it is more natural to use the finite-difference 
method. This can be done by fitting the model using the factor variable prefix 
i. for the discrete-valued regressors. 


For the probit model using factor variables, we obtain 


. * AMEs using finite-difference method 
. qui probit visit i.private i.chronic i.female income, vce(robust) 


. margins, dydx(*) 

Average marginal effects Number of obs = 4,412 
Model VCE: Robust 

Expression: Pr(visit), predict() 

dy/dx wrt: 1.private 1.chronic 1.female income 


Delta-method 


dy/dx std. err. z P>Izl [95% conf. interval] 

1.private . 2508057 .0175548 14.29 0.000 . 216399 . 2852123 
1.chronic . 313282 .0128372 24.40 0.000 . 2881215 . 3384425 
1.female . 1693009 .0130795 12.94 0.000 . 1436654 . 1949363 

income .0016796 .0002555 6.57 0.000 .0011787 .0021804 


Note: dy/dx for factor levels is the discrete change from the base level. 


There is some change in the estimated AME for the binary regressors, as great 
as from 0.229 to 0.251 for regressor private. There is no change for the 
continuous regressor income. 


Factor variables are especially useful for obtaining MEs in models with 
interacted regressors. For example, suppose private health insurance status is 
interacted with variable income. Then, 


. * AMEs with interacted regressors 
. probit visit i.private##c.income i.chronic i.female, vce(robust) nolog noheader 


Robust 
visit | Coefficient std. err. Zz P>|z| [95% conf. interval] 
1.private . 7278849 . 0800428 9.09 0.000 -5710039 . 8847659 
income .0037553 .0028952 1.30 0.195 -.0019192 . 0094297 
private# 
c.income 
1 . 0020059 . 0030257 0.66 0.507 - .0039244 .0079363 
1.chronic 1.065191 0511475 20.83 0.000 . 9649438 1.165438 
1.female . 552433 . 0434335 12.72 0.000 -4673049 6375611 
_cons - .9263058 .073073 -12.68 0.000 -1.069526 -.7830854 
. Margins, dydx(*) 
Average marginal effects Number of obs = 4,412 
Model VCE: Robust 
Expression: Pr(visit), predict() 
dy/dx wrt: 1.private income 1.chronic 1.female 
Delta-method 
dy/dx std. err. Zz P>lz| [95% conf. interval] 
1.private . 2609297 .0228076 11.44 0.000 .2162276 . 3056317 
income .0015848 . 0002874 5.51 0.000 .0010214 .0021482 
1.chronic . 3134803 .0128384 24.42 0.000 . 2883176 . 338643 
1.female . 1691342 .0130861 12.92 0.000 . 1434859 . 1947824 


Note: dy/dx for factor levels is the discrete change from the base level. 


The regressors include income, private, and income Xprivate. The margins 
command obtains the AME for income and the AME for private health 
insurance, allowing for their interaction. The AMEs are very close to those for 
the model without interactions because the interaction term had very little 
explanatory power. 


The standard errors that are reported above hold the regressor fixed. The 
option vce (unconditional) ofthe margins command additionally allows 
for variation due to sampling of the regressor; see section 13.7.9. In the 
preceding examples, this leads to less than a 1% increase in the standard 
errors of the AMEs. 


10.4.6 Simple interpretations of coefficients 


In a single-index model, the regressors x enter as a function of the linear 
combination x’. The probit and logit models, like many other nonlinear 
models, are of single-index form. Coefficient interpretation is simplified in 
such models. 


We focus on the probit model, with function ®(x’3). Then from (10.2) 
the MEis ME; = $(x’G)6;, where ¢(-) is the standard normal density. 
Because ¢(-) > 0 always, it follows that the sign of the ME equals the sign of 
bj. For example, if 6; > 0, then an increase in 7; is associated with an 
increase in Pr(y = 1). 


The ratio of MEs for two different regressors equals the ratio of the 
corresponding parameters because 


ME; _ O(x'B)8j _ Pj 
MEx = 9(x’B)B; k 


Therefore, if one coefficient is twice as big as another, then so too is the ME. 
This property applies to most commonly used nonlinear regression models, 
aside from multinomial models, where regressors appear as a linear 
combination x’. 


For example, from section 10.3.2, regressor private has coefficient 
0.766, and regressor chronic has coefficient 1.064. The effects for both 
regressors are positive because the coefficients are positive and ¢(x’) is 
positive. And having a chronic condition is associated with a 1.4 times bigger 
change in doctor visits than having private insurance (1.064/0.766 = 1.4). 


Additional interpretation of coefficients can be possible for specific 
single-index models. For example, for the probit model, ME 
j = O(x'B)B; < 0.48; as (z) takes a maximum value of 1/,/27 ~ 0.399 at 
z=0. 


10.4.7 Comparison with linear least squares 


What if we fit the model by OLS regression rather than probit regression? 


This alternative model is called the linear probability model. It specifies 
that 


Pro = 1x] 


which does not restrict the probability that y = 1 be in the (0, 1) interval. 
And it is implicitly based on an underlying normal distribution for y, whereas 
the data are clearly Bernoulli distributed. 


The regress command yields 


. * OLS regression (command regress) for linear probability model 
. regress visit private chronic female income, vce(robust) noheader 


Robust 
visit Coefficient std. err. t P>|t| [95% conf. interval] 
private . 2651354 .0171919 15.42 0.000 . 2314307 . 2988402 
chronic . 3059204 .0126069 24.27 0.000 . 2812046 . 3306361 
female . 170008 .0131967 12.88 0.000 . 1441359 . 19588 
income .0016152 .000227 7.11 0.000 .0011701 . 0020603 
_cons . 1922274 .0154689 12.43 0.000 . 1619006 . 2225542 
. predict phatols, xb 
. summarize visit phatols 
Variable | Obs Mean Std. dev. Min Max 
visit 4,412 .6359927 .4812052 (0) 1 
phatols 4,412 . 6359927 . 2290251 . 1922274 1.216793 


The slope coefficient estimates are approximately one-third the size of the 
corresponding probit estimates given in section 10.3.2. At the same time, the 
OLS slope coefficients are equal to the MEs because the model is linear. These 
MES are quite close to the probit model AMEs given in section 10.4.5. The 
biggest difference is 0.265 versus 0.229 (using calculus methods) or 0.251 
(using the finite-difference method) for variable private. 


In this example, one observation had a predicted probability as high as 
1.217, so OLS regression did lead to predicted probabilities outside the (0, 1) 
interval. More generally, OLS estimation may be a useful first step, but aside 


from the special case of a fully saturated model, it is best to use the probit or 
logit model. 


10.5 Logit model 


The logit model defined in (10.1) specifies Pr(y; = 1|x;) to equal A(x‘ @), 
where A(z) = e*/(1 + e7) rather than ®(x/Q). 


The functions A(-) and ®(-) not only are different functions but also are 
scaled quite differently. For example, while ®(0) = A(0) = 0.5 and both 
functions are symmetric about 0, (1) ~ 0.84 4 A(1) ~ 0.73. In fact, it can 
be shown that ®(z) ~ A(1.7z), so we might expect that 
®(x’B, obit) & A(1.7xX'Bprobit )» leading to logit coefficients that will be 
approximately 1.7 times the probit coefficients. 


The logit command has syntax similar to that given for the probit 
model in section 10.3.1. Logit regression in the current application yields 


. * Logit regression (command logit) 
. logit visit private chronic female income, vce(robust) 


Iteration 0: log pseudolikelihood = -2892.9 
Iteration 1: log pseudolikelihood = -2349.2911 
Iteration 2: log pseudolikelihood = -2332.1534 
Iteration 3: log pseudolikelihood = -2332.1197 
Iteration 4: log pseudolikelihood = -2332.1197 


Logistic regression Number of obs = 4,412 
Wald chi2(4) = 798.75 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -2332.1197 Pseudo R2 = 0.1938 


Robust 
visit Coefficient std. err. Zz P>lz| [95% conf. interval] 
private 1.27266 . 0896929 14.19 0.000 1.096866 1.448455 
chronic 1.832121 .092782 19.75 0.000 1.650271 2.01397 
female . 9280678 .0737619 12.58 0.000 . 7834971 1.072639 
income .0095378 .0015245 6.26 0.000 . 0065497 .0125258 
_cons -1.607674 .0907181 -17.72 0.000 -1.785478 -1.42987 


The logit slope coefficients are indeed approximately 1.7 times the probit 
slope coefficients obtained in section 10.3.2. At the same time, the ¢-ratios 
are within 5% of each other. 


The ME in the logit model obtained using calculus methods is 


ME; = [A(x’B) x {1 — A(x’B)}]6; 


Because 0 < A(-) x {1 — A(-)} < 0.25 always, it follows that the sign of 
the ME equals the sign of 8; and that ME; < 0.25 x 8. 


The AMEs following logit regression are 


. * AMEs from logit regression 
. margins, dydx(*) 


Average marginal effects Number of obs = 4,412 


Model VCE: Robust 


Expression: Pr(visit), predict() 
dy/dx wrt: private chronic female income 


Delta-method 
dy/dx std. err. z P>|z| [95% conf. interval] 
private . 2258599 0145665 15.51 0.000 . 1973102 . 2544097 
chronic .3251478 0141019 23.06 0.000 . 2975085 . 352787 
female . 1647049 .0122415 13.45 0.000 . 1407119 . 1886978 
income .0016927 .0002659 6.37 0.000 .0011716 .0022138 


These AMES are very similar to those obtained after probit regression, 
differing only in the third significant digit. And from output not given, the 
predicted probabilities from the two models are very similar and have 
correlation 0.9998. 


The logit model has the additional advantage that it implies that 


The left-hand side is the odds ratio, the ratio of the event occurring to the 
event not occurring. A one-unit increase in £j is associated with an 
approximate exp(3,) proportionate increase in the odds ratio. The or option 
of the logit command reports exponentiated coefficients. 


10.6 Nonlinear least squares 


Another common estimator of nonlinear models is the nonlinear least 
squares (NLS) estimator 3, which minimizes the sum of squared residuals 


Q(B) = 2 {y; - m(x B) F 


where m(x, (3) is the specified functional form for Æ (y|x), the conditional 
mean of y given x. 


If the conditional mean function is correctly specified, then the NLS 
estimator is consistent and asymptotically normally distributed. If the data- 
generating process is y; = m(x;, B) + ui, where the errors are independent 
and identically distributed (0, 07), then NLS has desirable efficiency 
properties, and the NLs default estimate of the VCE is correct. Otherwise, a 
robust estimate of the vcE should be used. 


10.6.1 The nl command 


Command n1 implements NLS regression. The simplest form of the command 
directly defines the conditional mean rather than calling a program or 
function and has syntax 


nl (depvar=<sexp>) lif | [ in | [ weight | ie options | 


where <sexp> is a substitutable expression. More details are provided in 
section 13.3.6 and in [R] nl. The only relevant option for our analysis here is 
option vce() for the type of estimate of the variance matrix of the estimates. 


10.6.2 NLS for probit model 


We consider application to the probit model, in which case 
m(x;, B) = ®(x,G). The normal () function computes the standard normal 


c.d.f. &(-), and the notation xb: is used to define a linear combination of 
variables; see section 13.3.6. As discussed in section 10.3.3, the 
vce (robust) option is used and we obtain 


* Nonlinear least-squares regression (command nl) for probit model 
. nl (visit = normal({xb: private chronic female income}+{b0})), vce(robust) 


Iteration 0: residual SS = 793.1719 


Iteration 1: residual SS = 784.2946 
Iteration 2: residual SS = 784.1461 
Iteration 3: residual SS = 784.1458 
Iteration 4: residual SS = 784.1458 
Iteration 5: residual SS = 784.1458 
Iteration 6: residual SS = 784.1458 
Nonlinear regression Number of obs = 4,412 
R-squared = 0.7205 
Adj R-squared = 0.7202 
Root MSE = .4218197 
Res. dev. = 4899.035 
Robust 
visit | Coefficient std. err. t P>|t| [95% conf. interval] 
/xb_private . 7570137 .0536031 14.12 0.000 .6519248 .8621027 
/xb_chronic 1.07188 .0564375 18.99 0.000 . 9612344 1.182526 
/xb_female .5551205 .04543 12.22 0.000 . 4660548 .6441861 
/xb_income .005946 .0009279 6.41 0.000 .0041269 .0077651 
/b0 -.9598078 .0545463 -17.60 0.000 -1.066746 -.8528696 


The n1 coefficient estimates are similar to those from probit (within 2% for 
all regressors except income) given in section 10.3.2. 


The ni robust standard errors are as much as 10% higher (for chronic). 
This efficiency loss is expected because, for a binary outcome, the NLS 
estimator differs from the fully efficient MLE that is obtained using the 
probit command. 


At the same time, the NLS estimator is much better than the OLS estimator 
because it uses a model that will lead to predicted probabilities lying 
between zero and one. 


The model diagnostic statistics given include R2 computed as the model 
(or explained) sum of squares divided by the total sum of squares, the root 
MSE that 1s the estimate s of the standard deviation o of the model error, and 


the deviance defined in section 13.8.3, which is a measure rarely used in 
econometrics. Note that R2 — 0.7205 is a very different measure than the 
probit model pseudo- R2 = 0.1940. 


10.7 Other nonlinear estimators 


In this section, we provide a brief summary of estimators detailed in the 
second volume. 


When the density of the dependent variable is specified, estimation is by 
maximum likelihood because the MLE is fully efficient if the density is 
correctly specified. This is the case for the probit command, for example. 
Many of the chapters in volume 2 present commands that obtain ML 
estimates for the standard nonlinear regression models, such as multinomial 
models and count models. 


When the specified density is not one already incorporated in Stata as a 
Stata command, one can still compute the MLE using the mlexp command, in 
the simplest cases, or the m1 command. These commands require providing 
an algebraic expression for the log density. Chapters 13 and 16 provide 
details. 


The generalized linear model (GLM) framework is the standard nonlinear 
model framework in many areas of applied statistics, most notably 
biostatistics. GLM estimators are essentially generalizations of least squares 
regression to nonlinear regression models for which there is a natural 
starting point for modeling the conditional mean and for modeling the 
intrinsic heteroskedasticity. GLMs cover many of the standard data types, 
including normal for y continuous on (—oo, oo), gamma for Y continuous 
on (0, oo), binomial and Bernoulli for number of successes in a given 
number of trials, and Poisson and negative binomial for count data with 
y = 0,1,2,.... GLM estimators have the important robustness-to- 
misspecification property that they are consistent provided only that the 
conditional mean function is correctly specified. 


The glm command fits models in this class, where the particular model 
being used is specified as a command option. These models include the 
probit and logit models. The g1m command provides more estimation and 
postestimation options than do model-specific commands such as the 
probit and logit commands. In particular, additional model diagnostics 


can be obtained, and the g1m command has an option to compute 
heteroskedastic autocorrelation consistent standard errors for time-series 
data. Section 13.3.8 provides further details. 


Generalized methods of moments estimation for linear models was 
presented in chapters 7 and 9. The gmm command, which will be discussed 
in more detail in section 13.3.10, enables generalized methods of moments 
estimation in nonlinear models. 


The nonlinear model used extensively in chapter 13 is one with 
exponential conditional mean, so E'(y|x) = exp(x’). Then the 
coefficients can be interpreted as semielasticities. For example, if 8; = 0.2, 
then a one-unit change in Tj is associated with a 0.2 proportionate change, 
or a 20% change, in E(y|x). This model can be fit using the poisson 
command, even if y is not a count. 


10.8 Additional resources 


A complete listing of estimation commands can be obtained using help 
estimation commands. For a given estimation command such as probit, 
see the entries [R] probit and [R] probit postestimation and the 
corresponding online help. 


Graduate econometrics texts give considerable detail on estimation and 
less on prediction and computation of MEs. 


Chapter 13 provides further detail on general methods for nonlinear 
models, including discussion of model diagnostics and the n1, mlexp, m1, 
glm, and gmm commands. Later chapters focus on the leading specific 
nonlinear models. 


10.9 Exercises 


— 


. You fit a probit model and estimate that Pr(y = 1|”) = ®(—1+ 2x). 


Give the formula for the ME of x on Pr(y = 1|x) using calculus 
methods and using the finite-difference method. Then, compute the 
MEM using each method and the knowledge that z = 1. Also, compute 
the MER at 7 = 2 using each method. Comment on any differences. 
Given the above information, can the AME be computed? Explain. 


. Repeat the previous question for a logit model with 


Pr(y = 1|x%) = A(—1 + 22). 


. Perform probit regression similar to section 10.3.1 of visit on 


private, chronic, and income separately for the male and female 
samples, using factor variables for the discrete regressors. Compare 
coefficient estimates across the two samples. Compare the AME across 
the two samples. Which, if any, is the more meaningful comparison? 
Perform a test at a significance level of 0.05 of whether there is a 
difference across the two samples. Hint: Nest the two samples in a 
larger model. 


. Run the full-sample logit and probit regressions of visit on private, 


chronic, female, and income. Use the predict command to generate 
fitted probabilities for each regression. Use the twoway scatter 
graphing command to plot a graph of the logit predictions on the probit 
predictions. What do you find? Comment. 


. In this question, use 1997 data, rather than 2002 data. From 


section 10.4.7, the linear probability model has deficiencies. 
Nonetheless, fit the model by linear regression of visit on private, 
chronic, female, and income. Also, fit the corresponding logit model, 
using factor variables. Compare the AMEs across the two models and 
comment. Next, compare the MEMs across the two models and with the 
preceding AMEs. Comment. 


. After fitting the linear probability and logit models of the preceding 


question, generate the predicted values using the predict 
postestimation command. Show that the average difference between 
the logit predictions and linear probability predictions is close to zero 
and that their correlation with each other is quite high. Check whether 
any of the fitted probabilities are outside the (0, 1) interval. Comment. 


Use the twoway scatter graphing command to plot a graph of the OLS 
predictions on the logit predictions. What do you find? Comment. 

. Using as a template the NLS approach to fitting the probit model 
presented in section 10.6.2, fit a logit model. This requires substituting 
the c.d.f. Logistic in place of the c.d.f. normal. Compare the 
estimated coefficients with those from the logit command. Would you 
expect the AME from this NLS regression to differ significantly from 
those for probit regression? Why or why not? 

. Suppose you are given a bivariate sample (Yi, £i) of two positive- 
valued observations. Two functional forms are suggested for a 
regression model: 1) In y = 69 + 6, ln z + e1 , and 2) a nonlinear 
regression y = agx™! + eg. Suppose that neither is preferred a priori 
and goodness-of-fit criterion carries high weight in the final selection. 
Assume that the errors are draws from normal zero-mean 
homoskedastic distributions. Use suitable values of the parameters to 
generate two samples, based on model 1 and model 2, respectively. 
Each model is treated as an approximation when the other model has 
generated the data. Use oLs to fit model (1) and NLS to fit model (11). 
Evaluate the performance of each misspecified (“wrong”) model in 
terms of goodness-of-fit or within-sample prediction. 


Chapter 11 
Tests of hypotheses and model specification 


11.1 Introduction 


Econometric modeling is composed of a cycle of initial model specification, 
estimation, diagnostic checks, and model respecification. The diagnostic 
checks are often based on hypothesis tests for the statistical significance of 
key variables and model-specification tests. This chapter presents additional 
details on hypothesis tests, associated confidence intervals, and model- 
specification tests that are used widely throughout the book. 


The emphasis is on Wald hypothesis tests and confidence intervals, the 
most commonly used inference methods in microeconometrics. These 
produce the standard regression output and can also be obtained by using 
the test, testnl, Lincom, and nlcom commands. We also present the other 
two classical testing methods, likelihood-ratio (LR) and Lagrange multiplier 
(LM) tests. 


We then present familywise error rates (FWER) and false discovery rates 
(FDRs) when multiple tests are performed, such as testing for statistical 
significance in each of several subgroups or for the impact of a key 
regressor on each of several outcomes. Failure to adjust test size in such 
cases leads to great understatement of true test size and hence false 
discoveries of statistical significance. 


In discussing results in this book, we have in many places stated 
whether a regressor or effect is statistically significant at 5%. One should 
realize that we have done this for brevity and convenience. In a real 
empirical study, one should instead emphasize the actual size of the effect 
(the “economic” significance) with a measure of uncertainty such as a 
standard error or confidence interval. For discussion of the weaknesses of 
considering only statistical significance and p-values, see Wasserstein and 
Lazar (2016) and the many articles in Wasserstein, Schirm, and 
Lazar (2019). 


When statistical significance is reported, there is growing concern 
among statisticians that p-values reported in published studies may 
understate the true p-value. In practice, considerable pretesting may occur 


before final model specification, and standard inference methods fail to 
consider the effect of this pretesting on size and power of any ultimate test. 
While the effects of such data mining can be minimized by relying on 
economic theory and previous studies in determining model specification, 
or by using a different sample from the final estimation sample, this is not 
always done. Furthermore, p-hacking and publication bias can lead to 
disproportionately favoring statistically significant results. For example, 
meta-analyses find a bunching of p-values from published studies in the 

in leading economics journals and find this bunching is more prevalent in 
instrumental variables (Iv) and difference-in-differences studies than in 
randomized control trials and regression discontinuity design studies and in 
all these cases is less prevalent than in some other social sciences. Methods 
to compute p-values that control for multiple testing are presented in 
section 11.6. 


Economic studies typically report test size with no consideration of test 
power. More recently, attention has turned to test power. One reason is that 
laboratory and field experiments have become more common, and these 
experiments need to be designed to have reasonable power, the standard 
threshold being that a test of size 0.05 should have power of at least 0.8 
against an alternative hypothesis of a desired effect size. 


We give considerable discussion of test size and power. Monte Carlo 
methods for obtaining test size and power are presented. The power 
onemean command can be adapted to regression settings to calculate test 
power for given effect size, minimum effect size for desired power, and 
minimum sample size for desired power. 


The chapter then presents a brief general discussion of model- 
specification tests, including information matrix (IM) tests, goodness-of-fit 
tests, Hausman tests, and tests of overidentifying restrictions, that are 
applied in various chapters. Model-selection tests for nonnested nonlinear 
models are not covered, though some of the methods given in chapter 3 for 
linear models can be extended to nonlinear models, and a brief discussion 
for likelihood-based nonlinear models is given in section 13.8.2. The 
chapter concludes with a discussion of permutation and randomization tests. 


This chapter uses Poisson regression, which will be discussed in more 
detail in section 13.2, as the leading example, so that we cover both linear 
and nonlinear models. In most places, the methods for ordinary least 
squares (OLS) are the same; simply change poisson to regress. 


11.2 Critical values and p-values 


Before discussing Stata estimation and testing commands and associated 
output, we discuss how critical values and p-values are computed. 


Introductory econometrics courses often emphasize use of the t(n) and 
F(h, n) distributions for hypothesis testing, where n is the degrees of 
freedom and h is the number of restrictions. For cross-sectional analysis, 
often n = N — K, where N is the sample size and Ķ is the number of 
regressors. For clustered data, Stata sets n = G — 1, where G is the number 
of clusters. 


These distributions hold exactly only in the very special case of tests of 
linear restrictions for the OLS estimator in the linear regression model with 
independent normal homoskedastic errors. Instead, virtually all inference in 
microeconometrics is based on asymptotic theory. This is the case not only 
for nonlinear estimators but also for linear estimators, such as OLS and Iv, 
with robust standard errors. Then test statistics are asymptotically standard 
normal, Z, distributed rather than t(n) and chi-squared, y?(h), distributed 
rather than F'(h, n). 


11.2.1 Standard normal compared with Student’s t 


The change from t(n) to standard normal distributions is relatively minor, 
unless n is small, say, less than 30. The two distributions are identical for 

n — oo. The ¢ distribution has fatter tails, leading to larger p-values and 
critical values than the standard normal at conventional levels of significance 
such as 0.05. 


For clustered data with G clusters, studies find that it is much better to 
use the ¢(G — 1) distribution rather than the standard normal. When there 
are few clusters, a not unusual occurrence, the consequent p-values and 
critical values can differ substantially. 


11.2.2 Chi-squared compared with F 


Many tests of joint hypotheses use the y? distribution. A x? (h) random 
variable has a mean of h and a variance of 2h, and for h > 7, the 5% critical 
value lies between } and 2h. 


The y?(h) distribution is scaled quite differently from the F. As the 
denominator degrees of freedom of the F goes to infinity, we have 


2 
A ee a eee (11.1) 


Thus, if asymptotic theory leads to a test statistic that is y?(h) distributed, 
then division of this statistic by } leads to a statistic that is approximately 

F(h, n) distributed if n is large. In finite samples, the F (h, n) distribution 
has fatter tails than y?(h)/h, leading to larger p-values and critical values 

for the F compared with the y2. 


11.2.3 Plotting densities 


We compare the density of a y? (5) random variable with a random variable 
that is 5 times an F'(5, 30) random variable. From (11.1), the two are the 
same for large n but will differ for n = 30. In practice, n = 30 is not large 
enough for asymptotic theory to approximate the finite distribution well. 


One way to compare is to evaluate the formulas for the respective 
densities at a range of points, say, 0.1, 0.2, ..., 20.0, and graph the density 
values against the evaluation points. The graph twoway function command 
automates this method. This is left as an exercise. 


This approach requires providing density formulas that can be quite 
complicated and may even be unknown to the user if the density is that of a 
mixture distribution, for example. A simpler way is to make many draws 
from the respective distributions, using the methods of section 5.2, and then 
use the kdensity command to compute and graph the kernel density 
estimate. 


We take this latter approach. We begin by taking 10,000 draws from each 
distribution. We use the rchi2() function; see section 5.2. 


* Create many draws from chi(5) and 5*F(5,30) distributions 
set seed 10101 


. qui set obs 10000 


. generate chi5 = rchi2(5) // Result xc ~ chisquared(10) 

. generate xfn = rchi2(5)/5 // Result numerator of F(5,30) 

. generate xfd = rchi2(30)/30 // Result denominator of F(5,30) 
. generate £5_30 = xfn/xfd // Result xf ~ F(5,30) 


. generate five_x_f5_30 = 5*f5_30 


summarize chid five_x_f5_30 


Variable Obs Mean Std. dev. Min Max 
chid 10,000 4.994329 3.169312 . 103002 33.19207 
five_x_f5_30 10,000 5.322791 3.790381 .0525943 31.24482 


For chi5, the average of 4.99 is close to the theoretical mean of 5, and the 
sample variance of 3,16932 = 10.04 is close to the theoretical variance of 10 
. For five x £5 30, the sample variance of 3.7992 = 14.36 is much larger 
than that of chi5, reflecting the previously mentioned fatter tails. 


We then plot the kernel density estimates based on these draws. To 
improve graph readability, we plot the kernel density estimates only for 
draws less than 25, using a method already explained in section 3.2.8. To 
produce smoother plots, we increase the default bandwidth to 1.0, an 
alternative being to increase the number of draws. We have 


. * Plot the densities for these two distributions using kdensity 
. label var chi5 "chi(5)" 


. label var five_x_f5_30 "5*F(5,30)" 

. kdensity chi5, bw(1.0) generate(kx1 kd1) n(500) 

. kdensity five_x_f5_30, bw(1.0) generate(kx2 kd2) n(500) 
. qui drop if (chi5 > 25 | five_x_f5_30 > 25) 


graph twoway (line kdi kx1) (line kd2 kx2, clstyle(p3)) if kx1 < 25, 
scale(1.2) plotregion(style(none) ) 
title("{kchi}{sup:2}(5) and 5*F(5,30) Densities") 
xtitle("y", size(medlarge)) xscale(titlegap(*5) ) 
ytitle("Density f(y)", size(medlarge)) yscale(titlegap(*5)) 
legend(pos(1) ring(0) col(1)) legend(size(small) ) 
legend(label(1 "{&chi}{sup:2}(5)") label(2 "5*F(5,30)")) 
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In figure 11.1, the two densities appear similar, though the density of 
5 x F'(5,30) has a longer tail than that of y? (5), and it is the tails that are 
used for tests at a 0.05 level and for 95% confidence intervals. The 
difference disappears as the denominator degrees of freedom (here 30) goes 
to infinity. 


x’(5) and 5*F(5,30) Densities 
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Figure 11.1. y2(5) density compared with 5 times F'(5, 30) density 
11.2.4 Computing p-values and critical values 


Stata output automatically provides p-values but not critical values. The P- 
values can be obtained manually from the relevant cumulative distribution 
function (c.d.f.), whereas critical values can be obtained by using the inverse 
c.d.f. The precise Stata functions vary with the distribution. For details, see 
[FN] Statistical functions or type help density functions. 


We compute p-values for the test of a single restriction (h = 1). We 


suppose the test statistic is equal to 2 and use the ¢(30) or Z distributions. In 
that case, it is equivalently equal to 92 — 4, and we use the F'(1, 30) or 


x?(1) distribution. We have 


. * p-values for t(30), F(1,30), Z, and chi(1) at y = 2 


. scalar y = 2 

. scalar p_t30 = 2*ttail(30,y) 

. scalar p_fiand30 = Ftail(1,30,y°2) 

. scalar p_z = 2*(1-normal(y)) 

. scalar p_chil = chi2tail(1,y~2) 

. display "p-values" " t(30) =" %7.4f p_t30 " 


p-values t(30) = 0.0546 F(1,30)= 0.0546 z = 0.0455 


F(1,30)=" %7.4¢ 
> p_fiand30 " z =" %7.4f p_z "  chi(1)=" %7.4f£ p_chil 


chi(1)= 0.0455 


The general properties that Z? = y7(1) and t(n)? = F(1, n) are confirmed 
for this example. Also, t(n) + Z and F(1,n)/1 — y2(1) as n — œ, but 
there is still a difference for n = 30, with a p-value of 0.0455 compared with 


0.0546. 


We next compute critical values for these distributions for a two-sided 


test of a single restriction at a level of 0.05. We have 


. * Critical values for t(30), F(1,30), Z, and chi(1i) at level 0.05 


. scalar alpha = 0.05 

. scalar c_t30 = invttail(30,alpha/2) 

. scalar c_fiand30 = invFtail(1,30,alpha) 
. scalar c_z = -invnormal (alpha/2) 


. scalar c_chil = invchi2(1,1-alpha) 


. display "critical values" " t(30) =" %7.3f c_t30 
> c_fiand30 " z =" %47.3f c_z " chi(1)=" %7.3f 


critical values t(30) = 2.042 F(1,30)= 4.171 


Zz 


F(1,30)=" 47 .3f 


c_chil 


1.960 


chi(1)= 


Again, t(30)? = F(1,30) and Z? = x?(1), whereas ¢(30) ~ Z and 


F(1,30)/1 = x2(1). 


11.2.5 Which distributions does Stata use? 


3.841 


In practice, the ¢ and F distributions may continue to be used as an ad hoc 
finite-sample correction, even when only asymptotic results supporting the 
Z and y? distributions are available. This leads to more conservative 
inference, with less likelihood of rejecting the null hypothesis because p- 
values are larger and with wider confidence intervals because critical values 
are larger. 


For estimators of linear regression models, Stata uses the t(N — K) and 
F(q, N — K) distributions for independent observations and, in some cases, 
the t(G — 1) and F(q, G — 1) distributions for clustered data with G 
clusters. For estimators of nonlinear regression models, Stata uses the z and 
x° (q) distributions for both independent observations and clustered 
observations. 


For clustered data, it is better to use the t(G — 1) and F(q, G — 1) 
distributions. Studies find that even this leads to tests that overreject, though 
less so than if the z and y? (q) distributions are used. F tests can always be 
implemented using the af (#) option of the test and testparm 
postestimation commands. 


11.3 Wald tests and confidence intervals 


A quite universal method for hypothesis testing and obtaining confidence intervals 
is the Wald method, based on the estimated variance—covariance matrix of the 
estimator (VCE) presented in sections 3.4 and 13.4. This method produces the test 
statistics and P-values for a test of the significance of individual coefficients, the 
confidence intervals for individual coefficients, and the tests of overall significance 
that are given in Stata regression output. 


Here we provide background on the Wald test, extension to tests of more 
complicated hypotheses that require the use of the test and testn1 commands, and 
extension to confidence intervals on combinations of parameters using the Lincom 
and nlcom commands. 


11.3.1 Wald test of linear hypotheses 


By a linear hypothesis, we mean one that can be expressed as a linear combination 
of parameters. Single hypothesis examples include Hp: 32 = 0 and 

Ho: B2 — 83 — 5 = 0. A joint hypothesis example tests the two preceding 
hypotheses simultaneously. 


The Wald test method is intuitively appealing. The test is based on how well the 
corresponding parameter estimates satisfy the null hypothesis. For example, to test 
Ho: 82 — 83 — 5 = 0, we ask whether 8, — 8, — 5 ~ 0. To implement the test, we 
need to know the distribution of 8, — 8, — 5. But this is easy because the 
estimators used in this book are usually asymptotically normal, and a linear 
combination of normals is normal. 


We do need to find the variance of this normal distribution. In this example, 
Var( Bo = B = 5) — Var(82) + Var(33) = 2Cov( Ba, B3) because for the random 
variables X and Y, Var(X — Y) = Var(X) + Var(Y) — 2Cov(X, Y). More 
generally, it is helpful to use matrix notation, which we now introduce. 


Let B denote the K x 1 parameter vector, where the results also apply if instead 
we use the more general notation 9, which includes 8 and any auxiliary parameters. 
Then, for example, Hp: 82 = 0 and 35 — 83 — 5 = 0 can be written as 


This linear combination can be written as RG — r = 0. 


For a two-sided test of h linear hypotheses under Ho, we therefore test 


Ho: RB-r=0 
Ha: RB -r #0 
where R is an h x K matrix and r is an h x 1 vector, h < K. Standard 
examples include tests of individual exclusions restrictions, 8; = 0, and tests of 


joint statistical significance, 82 = 0,..., 8, = 0 (with 63; as an intercept 
coefficient). 


The Wald test uses the quite intuitive approach of rejecting Hp: RG — r = 0 if 
RG — r is considerably different from 0. Now, 


B ~ N 4B, Var (a)? 
— RÂB-r &NÍRB-r, RVar B) R’) (11.2) 
=> RB -r \ N40, RVar (3) R'} under Ho 


For a single hypothesis, RB — r 1s a scalar that is univariate normally distributed, 
so we can transform to a standard normal variate and use standard normal tables. 


More generally, there are multiple hypotheses. To avoid using the multivariate 
normal distribution, we transform to a chi-squared distribution. If the h x 1 vector 
y ~ N(p, ©), then (y — ps)’ 1 (y — p) ~ y?(h). Applying this result to (11.2), 
we obtain the Wald statistic for the test of Hp: RG — r = 0: 


Ww = (RA = r) {RV (3) R} (RA — r) & x?(h) under Ho (11.3) 


Large values of W lead to rejection of Hp. At a level of 0.05, for example, we reject 
Ho if the p-value p = Pr{y?(h) > W} < 0.05 or if W exceeds the critical value 
c = Xê os(h), where by y2 9<(h) we mean the area in the right tail is 0.05. 


In going from (11.2) to (11.3), we also replaced Var() by an estimate, V(p): 
For the test to be valid, the estimate V() must be consistent for Var((3); that is, we 
need to use a correct estimator of the VCE. 


An alternative test statistic is the F statistic, which is the Wald statistic divided 
by the number of restrictions. Then, 


F = — ^ F(h,N — K) under Ho (11.4) 


>| 3 


where K denotes the number of parameters in the regression model. Large values of 
F lead to rejection of Ho. At a level of 0.05, for example, we reject Ho if the p- 
value p= Pr{F(h,N — K) > F} < 0.05 or if F exceeds the critical value 

C= Fo.05(h, N- K). 


11.3.2 The test and testparm commands 


The Wald test can be performed by using the test command or the testparm 
command. 


The test command has several different syntaxes. The simplest two are 


test coeflist 
test erp = ezp|= aoa 


The syntax is best explained with examples. More complicated syntax enables 
testing across equations in multiequation models. A multiequation example 
following the sureg command was given in section 6.8. An example following 
nbreg is given as an end-of-chapter exercise. 


In simple cases, the coeflist can be a list of regressor names, but in more 
complex cases, it needs to be a list of coefficient names. It can be difficult to know 
the Stata convention for naming coefficients in some cases. Using the estimation 
command option coeflegend or using the output from the postestimation command 
estat vce may give the appropriate complete names. 


In such situations, it may be easier to use the testparm command, which 
requires a list of variable names, rather than coefficient names, and has syntax 


testparm varlist Ee options | 


The testparm command is especially useful when the estimation command uses 
factor-variable notation. 


Usually, the W statistic in (11.3) based on the chi-squared distribution, is used, 
though the F statistic in (11.4) is used after fitting linear models. However, when 
cluster—robust standard errors are used and there are few clusters, using the chi- 
squared distribution, or N (0, 1) for a single hypothesis, leads to substantial 
overrejection; see section 6.4.6. At a minimum, one should then use the test 
command or testparm with the df (#) option, where # equals G — 1, where G is the 
number of clusters. This implements an F statistic with F (h, G — 1) degrees of 
freedom, where h is the number of hypotheses. The wild cluster bootstrap for few 
clusters is presented in section 12.6. 


The other options of the testparm command are usually not needed. They 
include mtest to test each hypothesis separately if several hypotheses are given and 
accumulate to test hypotheses jointly with previously tested hypotheses. 


11.3.3 Data example 


We illustrate the Wald test, and subsequent tests, using the dataset from the 2002 
U.S. Medical Expenditure Panel Survey first used in chapter 10. We model the 
number of office-based physician visits (docvis) by persons aged 25—64 years. The 
regressors are restricted to health insurance status (private), health status 
(chronic), and socioeconomic characteristics (female and income) to keep Stata 
output short. 


We consider a nonlinear estimator, the Poisson quasi-MLE, which will be 
discussed in more detail in section 13.2.2. The coefficients are interpreted as 
semielasticities; see section 13.3.2. We have 


* Fit Poisson model used throughout this chapter 
. qui use mus210mepsdocvisyoung, clear 


. qui keep if year02== 


. poisson docvis private chronic female income, vce(robust) nolog 


Poisson regression Number of obs = 4,412 

Wald chi2(4) = 594.72 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -18503.549 Pseudo R2 = 0.1930 
Robust 

docvis Coefficient std. err. Z P>lz| [95% conf. interval] 

private . 7986652 . 1090014 7.33 0.000 .5850263 1.012304 

chronic 1.091865 .0559951 19.50 0.000 .9821167 1.201614 

female . 4925481 .0585365 8.41 0.000 .3778187 . 6072774 

income .003557 .0010825 3.29 0.001 .0014354 .0056787 

_cons - .2297262 . 1108732 -2.07 0.038 - .4470338 -.0124186 


Test single coefficient 
To test whether a single coefficient equals zero, we just need to specify the 
regressor name. For example, to test Hg: Gremaie = 0, we have 


* Test a single coefficient equal 0 
. test female 


( 1) [docvis]female = 0 
chi2( 1) = 70.80 
Prob > chi2 = 0.0000 


We reject Ho because p < 0.05 and conclude that female is statistically significant 
at the level of 0.05. The test statistic is the square of the z statistic given in the 
regression output (8.4142 = 70.80), and the p-values are the same. 


Test several hypotheses 


As an example of testing more than one hypothesis, we test Ho: Bfemaie = 0 and 
Porivate at Pchronic =1. Then, 


* Test two hypotheses jointly using test 
. test (female) (private + chronic = 1) 


( 1) [docvis]female = 0 
( 2) [docvis]private + [docvis]chronic = 1 
chi2( 2) = 122.29 
Prob > chi2 = 0.0000 


We reject Ho because p < 0.05. 


The mtest option additionally tests each hypothesis in isolation. We have 


. * Test each hypothesis in isolation as well as jointly 
. test (female) (private + chronic = 1), mtest 


( 1) [docvis]female = 0 
( 2) [docvis]private + [docvis]chronic = 1 


chi2 df p> chi2 
(1) 70.80 1 0.0000* 
(2) 56.53 1 0. 0000%* 
All 122.29 2 0.0000 


* Unadjusted p-values 


As expected, the hypothesis test value of 70.80 for female equals that given earlier 
when the hypothesis was tested in isolation. 


The preceding test makes no adjustment to p-values to account for multiple 
testing. Options to mtest include several that implement Bonferroni’s method and 
variations. This extension is detailed in section 11.6. 


Test of overall significance 


The test command can be used to test overall significance. We have 


. * Wald test of overall significance 
. test private chronic female income 


( 1) [docvis]private = 0 


( 2) [docvis]chronic = 0 
( 3) [docvis]female = 0 
( 4) [docvis]income = 0 


chi2( 4) = 594.72 
Prob > chi2 = 0.0000 


The Wald test statistic value of 594.72 is the same as that given in the poisson 
output. 


Test calculated from retrieved coefficients and VCE 


For pedagogical purposes, we compute this overall test manually even though we 
use test in practice. The computation requires retrieving @ and V (8), defining the 
appropriate matrices R and r, and calculating W defined in (11.3). In doing so, we 
note that Stata stores regression coefficients as a row vector, so we need to 
transpose to get the K x 1 column vector G. Because we use Stata estimates of 3 


and 7 (3); in defining R and r, we need to follow the Stata convention of placing 
the intercept coefficient as the last coefficient. We have 


. * Manually compute overall test of significance using the formula for W 
. qui poisson docvis private chronic female income, vce(robust) 


. Matrix b = e(b)” 


. matrix V = e(V) 

. matrix R = (1,0,0,0,0 \ 0,1,0,0,0 \ 0,0,1,0,0 \ 0,0,0,1,0 ) 
. matrix r = (0 \0\0 \ 0) 

. matrix W = (R*b-r) “*invsym(R*V*R~)*(R*b-r) 


. scalar Wald = W[1,1] 
. scalar h = rowsof(R) 


. display "Wald test statistic: " Wald " with p-value: " chi2tail(h,Wald) 
Wald test statistic: 594.72457 with p-value: 2.15e-127 


The value of 594.72 is the same as that from the test command. 
11.3.4 One-sided Wald tests 


The preceding tests are two-sided tests, such as 3; = 0 against 8; # 0. We now 
consider one-sided tests of a single hypothesis, such as a test of whether 6; > 0. 


The first step in conducting a one-sided test is determining which side is Hp and 
which side is H,. The convention is that the claim made is set as the alternative 
hypothesis. For example, if the claim is made that the jth regressor has a positive 
marginal effect and this means that 6; > 0, then we test Ho: 8; < 0 against 
Ha: B; >0. 


The second step is to obtain a test statistic. For tests on a single regressor, we 
use the z statistic 


_ Bi 


z ~ N(0,1) under Ho 
S5; 


where z2 — W given in (11.3). In some cases, the t(N — K) distribution is used, in 
which case the z statistic is called a ¢ statistic. Regression output gives this statistic, 
along with p-values for two-sided tests. For a one-sided test, these p-values should 
be halved, with the important condition that it is necessary to check that B; has the 


correct sign. For example, if testing Ho: 6; < 0 against Ha : 6; > 0, then we reject 


Ho at the level of 0.05 if B; > 0, and the reported two-sided p-value is less than 
0.10. If instead B; < 0, the p-value for a one-sided test must be at least 0.50 
because we are on the wrong side of 0, leading to certain inability to reject at 
conventional statistical significance levels. 


As an example, consider a test of the claim that doctor visits increase with 
income, even after controlling for chronic conditions, gender, and income. The 
appropriate test of this claim is one of Hg: Bincome < 0 against Ha: Bincome > O. 
The poisson output includes — = 0.0036 with p = 0.001 for a two-sided test. 
Because Bincone > 0, we simply halve the two-sided test p-value to get 


p = 0.001/2 = 0.0005 < 0.05. So we reject Ho: Bincome < 0 at the 0.05 level. 


More generally, suppose we want to test the single hypothesis Hp: RG — r < 0 
against Ha: RG — r > 0, where here RG — r is a scalar. Then we use 


z= — À N(0,1) under Ho 


When squared, this statistic again equals the corresponding Wald test; that is, 

z2 = W. The test command gives W, but z could be either ,/W or — yW, and 
the sign of z is needed to perform the one-sided test. To obtain the sign, we can also 
compute R3 — r by using the 1incom command; see section 11.3.10. If RG — r has 
a sign that differs from that of RG — r under Ho, then the p-value is one half of the 
two-sided p-value given by test (or by lincom); we reject Hp at the a level if this 
adjusted p-value is less than a and do not reject otherwise. If instead RB — r has 
the same sign as that of RG — r under Ho, then we always do not reject Ho. 


11.3.5 Wald test of nonlinear hypotheses (delta method) 


Not all hypotheses are linear combinations of parameters. A nonlinear hypothesis 
example is a test of Ho: 82/63 = 1 against Ha: 82/63 4 1. This can be expressed 
as a test of g(3) = 0, where g(G) = 82/83 — 1. More generally, there can be h 
hypotheses combined into the h x 1 vector g(6) = 0, with each separate 
hypothesis being a separate row in g(3). Linear hypotheses are the special case of 


g(3) =RG-r. 


The Wald test method is now based on the closeness of g(B) to 0. Because B is 
asymptotically normal, so too is g(8). Some algebra that includes linearization of 


g(8) using a Taylor-series expansion yields the Wald test statistic for the nonlinear 
hypotheses Ho: g(6) = 0: 


Og (B) 
ag’ 


PN fas a aT! a\ a R— 
w=2(8) {Rv (2) R’} s (3) x?(h) under Ho, where R a) 


This is the same test statistic as W in (11.3) upon replacement of R8 — r by g(B) 
and replacement of R, by R. Again, large values of W lead to rejection of Ho, and 
p = Pr{x2(h) > W} 


The test statistic is often called one based on the delta method because of the 
derivative used to form R. 


11.3.6 The testnl command 


The Wald test for nonlinear hypotheses is performed using the testn1 command. 
The basic syntax is 


testnl exp = erp |= exp seel |s options ] 
The main option is mtest to separately test each hypothesis in a joint test. 


As an example, we consider a test of Ho: Bremaie/Pprivate — 1 = 0 against 
Ha: Btemale/Bprivate = 1 F 0. Then, 


. * Test a nonlinear hypothesis using testnl 
. testnl _b[female]/_b[private] = 1 


(1) _b[female]/_b[private] = 1 


chi2(1) 13.51 
Prob > chi2 0.0002 


We reject Ho at the 0.05 level because p < 0.05. 


The hypothesis in the preceding example can be equivalently expressed as 
Btemaie = Pprivate. SO a simpler test is 


. * Wald test is not invariant 
. test female = private 


( 1) - [docvis]private + [docvis]female = 0 


chi2( 1) 6.85 
Prob > chi2 0.0088 


Surprisingly, we get different values for the test statistic and p-value, even though 
both methods are valid and are asymptotically equivalent. This illustrates a 
weakness of Wald tests: in finite samples, they are not invariant to nonlinear 
transformations of the null hypothesis. With one representation of the null 
hypothesis, we might reject Ho at the a level, whereas with a different 
representation we might not. LR and LM tests do not have this weakness. 


11.3.7 Forward and backward selection based on statistical significance 


Forward selection, a specific-to-general approach, starts with the simplest model, an 
intercept-only model, and sequentially adds the most highly statistically significant 
regressor, provided this regressor is statistically significant at the prespecified 
significance level. 


The stepwise prefix with the pe () option implements forward selection. 


As an example, suppose we consider the preceding model with additional 
regressors firmsize, msa, and injury. Testing at level 0.05, we obtain 


. * Stepwise forward selection using statistical significance at 5. stepwise, pe(.05): poisson docvis private chronic female income 


> firmsize msa injury, vce(robust) 
Wald test, begin with empty model: 
p = 0.0000 < 0.0500, adding chronic 
p = 0.0000 < 0.0500, adding injury 
p = 0.0000 < 0.0500, adding private 
p = 0.0000 < 0.0500, adding female 
p = 0.0001 < 0.0500, adding income 
Poisson regression Number of obs = 4,412 
Wald chi2(5) = 857.27 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -17503.945 Pseudo R2 = 0.2366 
Robust 
docvis | Coefficient std. err. z P>I|zI (95% conf. interval] 
chronic . 9663314 . 0566569 17.06 0.000 . 8552858 1.077377 
injury - 7503917 - 0650373 11.54 0.000 - 6229209 . 8778626 
private - 791386 . 106635 7.42 0.000 . 5823854 1.000387 
female . 528642 - 0567367 9.32 0.000 -41744 . 6398439 
income . 0042092 -001076 3.91 0.000 -0021003 -0063182 
-cons -. 4054209 . 1079598 -3.76 0.000 -.6170182 -.1938236 


The preferred model includes injury but not firmsize and msa. 


Backward selection, a general-to-specific approach, starts with the most general 
model and sequentially drops the least statistically significant regressor, provided 
this regressor is statistically insignificant at the prespecified significance level. 


The pr () option of the stepwise prefix implements backward selection. The 


option hierarchical of the stepwise prefix implements forward or backward 
selection in order of the specified regressors. 


11.3.8 Pretest bias 


The inclusion of variables on the basis of statistical significance leads to so-called 
pretest bias of the OLS estimator. 


Suppose the true model is y = 3; + S22 + u, where G2 Æ 0, x is nonstochastic, 
and u ~ N (0, o°). Then, given a random sample of size N, the OLS estimator from 
regression of y on an intercept and x is unbiased, so E(B) = Bo: 


Suppose instead we first test for statistical significance of x and include x in the 
model only if x is statistically significant at level 0.05, that is, if 
ltl = |B2/ 83,| > t.o25(N — 2). Then this estimator Bp equals 8, with probability 
less than one and equals zero with probability greater than zero. It follows that Bo is 
biased for 62, with bias toward zero. 


Going the other way, if 62 = 0, then 5% of the time x will be erroneously 
included as a regressor. This problem is compounded when many potential 
regressors are considered, and test P-values should be appropriately adjusted; see 
section 11.6 on multiple testing. 


To avoid such complications, microeconometrics studies commonly minimize 
pretesting. Instead, models include many regressors, suggested by economic theory 
or previous studies, or both, regardless of whether they are statistically significant. 


Kozbur (2020) proposes a testing-based method for sequentially selecting 
regressors until a stopping rule is reached. Machine learning methods such as the 
lasso include as regressors those variables that best predict y, and some methods do 
so in ways that avoid the complication of pretest bias; see section 28.8. 


11.3.9 Wald confidence intervals 


Stata output provides Wald confidence intervals for individual regression 


parameters 6; of the form B; E Za/2 X $3 5 where 2a/2 is a standard normal critical 
J 


value. For some linear-model commands, the critical value is from the ¢ distribution 
rather than the standard normal. The default is a 95% confidence interval, which is 
Bj £1.96 x 85 if standard normal critical values (with œ = 0.05) are used. This 

J 
default can be changed in Stata estimation commands by using the level () option, 
or it can be changed globally by using the set level command. 


Now consider any scalar, say, 7, that is a function g() of 3. Examples include 
y = b2, Y = b2 + b3, and y = 82/63. A Wald 100(1 — a)% confidence interval for 
yis 


Vx 2a/2 X 8% (11.6) 


where 7 = g) where by 7a/2 we mean the area in the right tail is a/2 and $ẹ is 
the standard error of ¥. For the nonlinear estimator B, the critical value 2a/2 is 
usually used, and for the linear estimator, the critical value ta/2 is usually used. 
Implementation requires computation of ¥ and $7, using (11.7) and (11.8) given 
below. 


11.3.10 The lincom command 


The lincom command calculates the confidence interval for a scalar linear 
combination of the parameters RG — r. The syntax is 


lincom exp [s options | 


The eform option reports exponentiated coefficients, standard errors, and 
confidence intervals. This is explained in section 11.3.12. 


The confidence interval is computed by using (11.6), with 7 — RG — r and the 
squared standard error 


2 =RV (3) R’ (11.7) 


We consider a confidence interval for private + Gchronic — 1. We have 


. * Confidence interval for linear combinations using lincom 
. qui use mus210mepsdocvisyoung, clear 


. qui keep if year02== 
. qui poisson docvis private chronic female income if year02==1, vce(robust) 
. lincom private + chronic - 1 


( 1) [docvis]private + [docvis]chronic = 1 


docvis Coefficient Std. err. z P>|z| [95% conf. interval] 


(1) . 8905303 . 1184395 7.52 0.000 . 6583932 1.122668 


The 95% confidence interval is [0.66, 1.12] and is based on standard normal critical 
values because we used the 1incom command after poisson. If instead it had 


followed regress, then t(N — K) critical values would have been used. 


The lincom command also provides a test statistic and p-value for the two-sided 
test of Ho: Pprivate T Benronic —1=0. Then 
z? = 7.52? ~ (0.8905303/0.1184395)? = 56.53, which equals the W obtained in 
section 11.3.3 in the example using test, mtest. The lincom command enables a 
one-sided test because, unlike using W, we know the sign of z. 


11.3.11 The nlcom command (delta method) 


The nlcom command calculates the confidence intervals in (11.6) for a scalar 
nonlinear function g( 6) of the parameters. The syntax is 


nicom [ name: | exp iy options | 


The confidence interval is computed by using (11.6), with 7 — 9(B) and the 
squared standard error 


s2 = 04/00 av (8) 7/00! 


; (11.8) 


The standard error 57 and the resulting confidence interval are said to be computed 
by the delta method because of the derivative 07/08. 


As an example, consider confidence intervals for y = Bremaie/ private — 1. We 
have 


. * Confidence interval for nonlinear function of parameters using nlcom 
. nlcom _b[female] / _b[private] - 1 


_nl_1: _b[female] / _b[private] - 1 


docvis Coefficient Std. err. Zz P>l|z| [95% conf. interval] 


-nl_1 -. 383286 . 1042734 -3.68 0.000 -.587658 -.1789139 


Note that z? = (—3.68)? ~ (—0.383286/0.1042734)? = 13.51. This equals the W 
for the test of Ho: Stemaie/ private — 1 obtained by using the testn1 command in 
section 11.3.6. 


11.3.12 Asymmetric confidence intervals 


For several nonlinear models, such as those for binary outcomes and durations, 
interest often lies in exponentiated coefficients that are given names such as hazard 
ratio or odds ratio depending on the application. In these cases, we need a 
confidence interval for ¿ô rather than 8. This can be done by using either the 
lincom command with the eform option or the nicom command. These methods 
lead to different confidence intervals, with the former preferred. 


We can directly obtain a 95% confidence interval for exp(Sprivate), using the 
lincom, eform command. We have 


. * Confidence interval for exp(b) using lincom option eform 
. lincom private, eform 


( 1) [docvis]private = 0 


docvis exp(b) Std. err. z P>lz| [95% conf. interval] 


(1) 2.222572 . 2422636 7.33 0.000 1.795038 2.751935 


This confidence interval is computed by first obtaining the usual 95% confidence 
interval for Pprivate and then exponentiating the lower and upper bounds of the 
interval. We have 


. * Confidence interval for exp(b) using lincom followed by exponentiate 
. lincom private 


( 1) [docvis]private = 0 


docvis Coefficient Std. err. z P>|zl [95% conf. interval] 


(1) . 7986652 . 1090014 7.33 0.000 . 5850263 1.012304 


Because private € [0.5850, 1.0123], it follows that exp(Sprivate) € [e? °°, 
e1:0123], so exp(6private) € [1.795, 2.752], which is the interval given by 1incom, 


eform. 


If instead we use nicom, we obtain 


. * Confidence interval for exp(b) using nlcom 
. nlcom exp(_b[private]) 


_nl_1: exp(_b[private]) 


docvis Coefficient Std. err. Zz P>l|z| [95% conf. interval] 


-nl_1 2.222572 . 2422636 9.17 0.000 1.747744 2.6974 


The interval is instead exp(Gprivate) € [1.748, 2.697]. This differs from the [1.795, 
2.752] interval obtained with 1incom, and the difference between the two methods 
can be much larger in other applications. 


Which interval should we use? The two are asymptotically equivalent but can 
differ considerably in small samples. The interval obtained by using nicom is 
symmetric about exp( aie) and could include negative values (if 8 is small 
relative to $3). The interval obtained by using lincom, eform is asymmetric and, 
necessarily, is always positive because of exponentiation. This is preferred. 


11.3.13 Confidence intervals from inverting a test statistic 


A confidence interval can be obtained by inverting a test. Specifically, to obtain a 
95% confidence interval for 8, we perform a two-sided test of 8 = 8* for a range of 
values of 8*. The confidence interval is then those values of 8* for which the test 
had p > 0.05 because the 95% confidence interval includes those values that we do 
not reject at level 0.05. 


To illustrate this method, we perform a series of Wald tests to obtain a 95% 
Wald confidence interval for private, using a grid search over null hypothesis 
values of 6* ranging from — 2.0 to 2.0. We obtain 


. * Confidence interval from inverting test statistic 
. qui poisson docvis private chronic female income, vce(robust) nolog 


. postfile cifromtest b2 pvalue using pvalues, replace 
(file pvalues.dta not found) 


. forvalues i = 1/1000 { 


2. scalar b2 = (~i° - 500)/250 
3. qui test _bl[private] = b2 
4. scalar p = r(p) 

5. post cifromtest (b2) (p) 
6. } 


. postclose cifromtest 
. use pvalues, clear 


. sum if pvalue > 0.05 


Variable Obs Mean Std. dev. Min Max 
b2 107 .8 . 124129 .588 1.012 
pvalue 107 . 3968495 . 2846993 .050327 . 9902298 


The resulting 95% confidence interval for private 18 [0.588, 1.012] because tests 
that private = 8* had p > 0.05 for 8* between 0.588 and 1.012. 


This confidence interval, aside from rounding error, is the same as [0.585, 1.012 
], the usual Wald confidence interval given earlier in the usual output from the 
poisson command. There is no reason to use this method if a Wald confidence 
interval is available. 


The advantage of this more general approach is that it can be used in a wider 
range of settings. For example, if the only test available is an LR test, then the same 
procedure can be used to obtain an LR confidence interval. 


In more general cases the confidence interval need not be contiguous. For 
example, one can obtain confidence intervals such as {[—oo , 1.2] U [5.3, 8.7] }. 
Confidence intervals obtained using weak instruments asymptotics are obtained by 
inversion of test statistics and thus can be composed of disjoint intervals. 


11.4 Likelihood-ratio tests 


An alternative to the Wald test is the LR test. This is applicable only to ML 
estimation, under the assumption that the density is correctly specified. 


11.4.1 LR test statistic 


Let L(@) = f(y|X, 0) denote the likelihood function, and consider testing 
the h hypotheses Ho: g(@) = 0. Distinguish between the usual unrestricted 
maximum-likelihood estimator (MLE) @,, and the restricted MLE @,, that 
maximizes the log likelihood subject to the restriction g(@) = 0. 


The motivation for the LR test is that if Ho is valid, then imposing the 
restrictions in estimation of the parameters should make little difference to 
the maximized value of the likelihood function. The LR test statistic is 


i= {In (6,.) Ink (6.) } & y2(h) under Ho 


At the 0.05 level, for example, we reject if p = Pr{y?(h) > LR} < 0.05 or, 
equivalently, if LR > y2 o5 (h), where by y2 o5 (h) we mean the area in the 
right tail is 0.05. It is unusual to use an F variant of this test. 


The LR and Wald tests, under the conditions in which vce (oim) is 
specified, are asymptotically equivalent under Ho and local alternatives, so 
there is no a priori reason to prefer one over the other. 


Nonetheless, the LR test is preferred in fully parametric settings, in part 
because the LR test is invariant under nonlinear transformations, whereas the 
Wald test is not, as was demonstrated in section 11.3.6. 


Microeconometricians use Wald tests more often than LR tests because, 
wherever possible, fully parametric models are not used. For example, 
consider a linear regression with cross-sectional data. Assuming normal 
homoskedastic errors permits the use of an LR test. But the preference is to 


relax this assumption, obtain a robust estimate of the vce, and use this as the 
basis for Wald tests. 


The LR test requires fitting two models, whereas the Wald test requires 
fitting only the unrestricted model. And restricted ML estimation is not 
always possible. The Stata ML commands can generally be used with the 
constraint () option, but this supports only linear restrictions on the 
parameters. 


Stata output for ML estimation commands uses LR tests in two situations: 
first, to perform tests on a key auxiliary parameter; and second, in the test for 
joint significance of regressors automatically provided as part of Stata 
output, if the default vce (oim) option is used. 


We demonstrate this for negative binomial regression, a generalization of 
Poisson regression, for doctor visits by using default ML standard errors. We 
have 


. * LR tests output if estimate by ML with default estimate of VCE 
. qui use mus210mepsdocvisyoung, clear 


. qui keep if year02== 


. nbreg docvis private chronic female income, nolog 


Negative binomial regression Number of obs = 4,412 
LR chi2(4) = 1067.55 

Dispersion: mean Prob > chi2 = 0.0000 
Log likelihood = -9855.1389 Pseudo R2 = 0.0514 
docvis | Coefficient Std. err. Zz P>|z| [95% conf. interval] 
private .8876559 .0594232 14.94 0.000 . 7711886 1.004123 
chronic 1.143545 .0456778 25.04 0.000 1.054018 1.233071 
female . 5613027 . 0448022 12.53 0.000 . 473492 .6491135 
income . 0045785 . 000805 5.69 0.000 . 0030007 .0061563 

_cons - .4062135 0611377 -6.64 0.000 -.5260411 -.2863858 
/lnalpha . 5463093 .0289716 -4895261 . 6030925 
alpha 1.726868 .05003 1.631543 1.827762 

LR test of alpha=0: chibar2(01) = 1.7e+04 Prob >= chibar2 = 0.000 


Here the overall test for joint significance of the four coefficients, given 
aS LR chi2(4) = 1067.55, is an LR test. 


The last line of output provides an LR test of Hp: a = 0 against 
H.: a > 0. Rejection of Ho favors the more general negative binomial 
model because the Poisson is the special case q = (). This LR test is 
nonstandard because the null hypothesis is on the boundary of the parameter 
space (the negative binomial model restricts œ > 0). In this case, the LR 
statistic has a distribution that has a probability mass of 1/2 at 0 and a half- 
x?(1) distribution above 0. This distribution is known as the chibar-0-1 
distribution and is used to calculate the reported p-value of 0.000, which 
strongly rejects the Poisson in favor of the negative binomial model. 


More generally, an LR test of more than one hypothesis may have the null 
hypothesis on the boundary of the parameter space. For example, this is the 
case for tests of the joint statistical significance of variance components in 
mixed models fit using the mixed and me commands. Then the distribution of 
the LR test statistic is unknown. Stata computes p-values using the usual 
x?(h) distribution. This leads to a conservative test, one less likely to reject 
the null hypothesis, because the true p-value, while not precisely known, is 
at least known to be smaller than that obtained using the y?(h) distribution. 


11.4.2 The Irtest command 


The 1rtest command calculates an LR test of one model that is nested in 
another when both are fit by using the same ML command. The syntax is 


lrtest modelspec1 | modelspec? | E options | 


where ML results from the two models have been saved previously by using 
estimates store With the names modelspec! and modelspec2. The order of 
the two models does not matter. The variation 1rtest modelspec1 requires 
applying estimates store only to the model other than the most recently 
fitted model. 


We perform an LR test of Ho: Bprivate = 9, Bchronic = 0 by fitting the 
unrestricted model with all regressors and then fitting the restricted model 
with private and chronic excluded. We fit a negative binomial model 
because this is a reasonable parametric model for these overdispersed count 
data, whereas the Poisson was strongly rejected in the test of Hp: a = 0 in 
the previous section. We have 


. * LR test using command lrtest 
. qui nbreg docvis private chronic female income 


. estimates store unrestrict 

. qui nbreg docvis female income 
. estimates store restrict 

. lrtest unrestrict restrict 


Likelihood-ratio test 
Assumption: restrict nested within unrestrict 


LR chi2(2) = 808.74 
Prob > chi2 0.0000 


The null hypothesis is strongly rejected because p = 0.000. We conclude 
that private and chronic should be included in the model. 


The same test can be performed with a Wald test. To be comparable with 
the LR test, we use default standard errors here, though in practice the Wald 
test is used following estimation with robust standard errors. 


. * Wald test of the same hypothesis and using default standard errors (like LR) 
. qui nbreg docvis private chronic female income 


. test chronic private 


( 1) [docvis] chronic (0) 
( 2) [docvis]private = 0 


chi2( 2) = 852.26 
Prob > chi2 = 0.0000 


The results differ somewhat, with test statistics of 809 and 852. The 
differences in these asymptotically equivalent tests can be considerably 
larger in other applications, especially those with few observations. 


11.4.3 Direct computation of LR tests 


The default is for the 1rtest command to compute the LR test statistic only 
in situations where it is clear that the LR test is appropriate. The command 
will produce an error when, for example, the vce (robust) option is used or 
when different estimation commands are used. The force option causes the 
LR test statistic to be computed in such settings, with the onus on the user to 
verify that the test is still appropriate. 


As an example, we return to the LR test of Poisson against the negative 
binomial model, automatically given after the nbreg command, as discussed 
in section 11.4.1. To perform this test using the 1rtest command, we need 
the force option because two different estimation commands, poisson and 
nbreg, are used. We have 


. * LR test using option force 
. qui nbreg docvis private chronic female income 


. estimates store nb 

. qui poisson docvis private chronic female income 
. estimates store poiss 

. lrtest nb poiss, force 


Likelihood-ratio test 
Assumption: poiss nested within nb 


LR chi2(1) = 17296.82 
Prob > chi2 = 0.0000 
. display "Corrected p-value for LR-test = " r(p)/2 


Corrected p-value for LR-test = 0 


As expected, the LR statistic is the same as chibar2 (01) =1.7e+04, reported 
in the last line of output from nbreg in section 11.4.1. The 1rtest command 
automatically computes p-values using y?(h), where h is the difference in 
the number of parameters in the two fitted models, here y7(1). As explained 
in section 11.4.1, however, half-.?(1) should be used in this particular 
example, providing a cautionary note for the use of the force option. 


11.4.4 Tests at the boundary 


As noted in section 11.4.1, in some cases, the range of a parameter is 
restricted, and a test is performed of whether the parameter takes the value at 
its boundary. In particular, the parameter of interest may be a variance that is 
by definition restricted to be nonnegative, and we may want to test whether 
it equals zero. In such cases, the distribution of the LR statistic for tests of q 
restrictions is no longer the y2(q) distribution. 


For tests of a single restriction, the LR statistic has a half-\?(1) 
distribution, which Stata output labels chibar2 (01). This distribution can be 
used in the usual way for p < 0.50. 


For q > 1, the actual distribution is quite complicated. Instead, the x? (q) 
distribution continues to be used. This provides a conservative test. For 
example, if the reported p = 0.07, then the true p < 0.07. A common 
example is a joint hypothesis test on the variances and covariances in a 
mixed model such as that in section 6.7.6. 


11.5 Lagrange multiplier test (or score test) 


The third major hypothesis testing method is a test method usually referred 
to as the score test by statisticians and as the LM test by econometricians. 
This test is less often used, aside from some leading model-specification 
tests in situations where the null hypothesis model is easy to fit but the 
alternative hypothesis model is not. 


11.5.1 LM tests 


The unrestricted MLE @,, sets s(6,,) = 0, where s(@) = ô ln L(@)/06 is 
called the score function. An LM test, or score test, is based on closeness of 
s(0,) to zero, where evaluation is now at 0, the alternative restricted MLE 
that maximizes In L(0) subject to the h restrictions g(@) = 0. The 
motivation is that if the restrictions are supported by the data, then 9. ~ 6, 


~ ma 


so s(0,.) ~ s(0,,) = 0. 


Because s(6,.)  N{0, Var(0,„)} we form a quadratic form that is a 
chi-squared statistic, similar to the method in section 11.3.1. This yields the 
LM test statistic, or score test statistic, for Ho : g(@) = 0: 


LM=s (6,) IV {s (ð) \ i S (6,.) A x? (h) under Ho 


At the 0.05 level, for example, we reject if p = Pr{y?(h) > LM} < 0.05 or, 
equivalently, if LM > x2 o5 (A). It is not customary to use an F variant of 
this test. 


The preceding motivation explains the term “score test”. The test is also 
called the LM test for the following reason: Let In L(0) be the log-likelihood 
function in the unrestricted model. The restricted MLE 9,, maximizes In L(0) 
subject to g(@) = 0, so g, maximizes In L(@) — A'g(0). An LM test is based 
on whether the associated LMs X, of this restricted optimization are close to 


0 because A = 0 if the restrictions are valid. It can be shown that X, is a 


~ 


full-rank matrix multiple of s(0,.), SO the LM and score tests are equivalent. 


Under the conditions in which vce (oim) is specified, the LM test, LR test, 
and Wald test are asymptotically equivalent for Hp and local alternatives, so 
there is no a priori reason to prefer one over the others. The attraction of the 
LM test is that, unlike Wald and LR tests, it requires fitting only the restricted 
model. This is an advantage if the restricted model is easier to fit, such as a 
homoskedastic model rather than a heteroskedastic model. Furthermore, an 
asymptotically equivalent version of the LM test can often be computed by 
the use of an auxiliary regression. On the other hand, there is generally no 
universal way to implement an LM test, unlike Wald and LR tests. If the LM 
test rejects the restrictions, we then still need to fit the unrestricted model. 


11.5.2 The estat commands 


Because LM tests are estimator specific and model specific, there is no 
Imtest command. Instead, LM tests usually appear as estat postestimation 
commands to test misspecifications. 


A leading example is the estat hettest command to test for 
heteroskedasticity after regress. This LM test is implemented by auxiliary 
regression, which is detailed in section 3.7. The default version of the test 
requires that under the null hypothesis, the independent homoskedastic 
errors must be normally distributed, whereas the iid option relaxes the 
normality assumption to one of independent and identically distributed 
errors. 


Another example is the xt test0 command to implement an LM test for 
random effects after xtreg. Yet another example is the LM test for 
overdispersion in the Poisson model, given in an end-of-chapter exercise. 


11.5.3 LM test by auxiliary regression 


For ML estimation with a correctly specified density, an asymptotically 
equivalent version of the LM statistic can always be obtained from the 
following auxiliary procedure. First, obtain the restricted MLE @,.. Second, 


form the scores for each observation of the unrestricted model, 

s;(0) = On f(y;|x:, @)/00, and evaluate them at 8, to give s;(0,)- Third, 
compute N times the uncentered R2 (or, equivalently, the model sum of 
squares) from the auxiliary regression of 1 on s; (8) 


It is easy to obtain restricted model scores evaluated at the restricted MLE 
or unrestricted model scores evaluated at the unrestricted MLE. However, this 
auxiliary regression requires computation of the unrestricted model scores 
evaluated at the restricted MLE. If the parameter restrictions are linear, then 
these scores can be obtained by using the constraint command to define 
the restrictions before estimation of the unrestricted model. 


We illustrate this method for the LM test of whether Ho: private = 9, 
Bchronic = 0 in a negative binomial model for docvis that, when 
unrestricted, includes as regressors an intercept, female, income, private, 
and chronic. The restricted MLE B, is then obtained by negative binomial 
regression of docvis on all these regressors, subject to the constraint that 
Borivate = 0 and Benronic = 0. The two constraints are defined by using the 
constraint command, and the restricted estimates of the unrestricted model 
are obtained using the noreg command with the constraints () option. 
Scores can be obtained by using the predict command with the scores 
option. However, these scores are derivatives of the log density with respect 
to model indices (such as x’) rather than with respect to each parameter. 
Thus, following nbreg only two “scores” are given, ô In f(y;)/Ox/@ and 
Oln f(y;)/Oa. These two scores are then expanded to K + 1 scores 
Oln f(y:)/08; = {8 ln f(y:)/Ox,B} x Lijs j =1,...,K, where K is the 
number of regressors in the unrestricted model, and 0 In f (y;)/Oa, where a 
is the scalar overdispersion parameter. Then, 1 is regressed on these K + 1 
scores. 


We have 


. * Perform LM test that b_private=0, b_chronic=0 using auxiliary regression 
. qui use mus210mepsdocvisyoung, clear 


. qui keep if year02== 

. generate one = i 

. constraint define 1 private = 0 

. constraint define 2 chronic = 0 

. qui nbreg docvis female income private chronic, constraints(1 2) 
. predict eqscore ascore, scores 

. generate sirestb = eqscore*one 

. generate s2restb = eqscore*female 

. generate s3restb = eqscore*income 

. generate s4restb = eqscore*private 

. generate s5restb = eqscore*chronic 

. generate salpha = ascore*one 

. qui regress one sirestb s2restb s3restb s4restb sSrestb salpha, noconstant 
. scalar lm = e(N)*e(r2) 


. display "LM = N x uncentered Rsq = " lm " and p = " chi2tail(2,1m) 
LM = N x uncentered Rsq = 424.17616 and p = 7.786e-93 


The null hypothesis is strongly rejected with LM = 424. By comparison, in 
section 11.4.2, the asymptotically equivalent LR and Wald statistics for the 
same hypothesis were, respectively, 809 and 852. 


The divergence of these purportedly asymptotically equivalent tests is 
surprising given the large sample size of 4,412 observations. One 
explanation, always a possibility with real data, is that the unknown data- 
generating process (DGP) for these data is not the fitted negative binomial 
model—the asymptotic equivalence holds only under Ho, which includes 
correct model specification. A second explanation is that this LM test has 
poor size properties even in relatively large samples. This explanation could 
be pursued by adapting the simulation exercise in section 11.7 to one for the 
LM test with data generated from a negative binomial model. 


Often, more than one auxiliary regression is available to implement a 
specific LM test. The easiest way to implement an LM test is to find a 
reference that defines the auxiliary regression for the example at hand and 
then implement the regression. For example, to test for heteroskedasticity in 
the linear regression model that depends on variables z;, we calculate Ny 
times the uncentered explained sum of squares from the regression of 


squared OLS residuals &? on an intercept and Z;; all that is needed is the 
computation of ®;. In this case, estat hettest implements this anyway. 


The auxiliary regression versions of the LM test are known to have poor 
size properties, though in principle these can be overcome by using the 
bootstrap with asymptotic refinement. For example, see Cribari-Neto and 
Zarkos (1999). 


11.6 Multiple testing 


Standard theory assumes that hypothesis tests are done once only and in 
isolation. In some social sciences and biological sciences, many experiments 
on a phenomenon may be run by different researchers, with only studies 
statistically significant at level 0.05 being published because of publication 
bias. Then the multiple testing methods presented below are not feasible 
because of lack of knowledge of the unpublished studies. In such settings, a 
highly cited study (Benjamin et al. 2018) has proposed that for claims of new 
discoveries, the default p-value for statistical significance should be 0.005. 
For a standard asymptotic two-sided Wald test, this corresponds to rejection 
if |t| > 2.80. 


Many empirical studies in economics include multiple tests or multiple 
comparisons that simultaneously test several hypotheses. Examples include 
testing the statistical significance of a key regressor in several subgroups of 
the sample (subgroup analysis); testing the statistical significance of a key 
regressor in regressions on a range of outcomes (multiple outcomes); and 
testing the significance of a wide range of variables on a single outcome 
(multiple regressors), such as which specific genes are related to a particular 
form of cancer. 


Performing many such tests at standard significance levels such as five 
percent is very likely to lead to spurious statistical significance. For example, 
in 100 independent tests at level 0.05, we expect on average to obtain 5 
statistically significant results, and at least 1 statistically significant result 
with high probability, even if no effect is present. In such cases, one should 
view the entire battery of tests as a unit and appropriately adjust the 
significance levels of individual tests. 


Controlling for multiple testing makes it more difficult to reject the null 
hypothesis. Furthermore, the simplest methods developed for multiple testing 
tend to be conservative, leading to low power and underrejection of the null 
hypothesis. The development of less conservative methods is a current area 
of research. 


11.6.1 Family-wise error rate 


We now consider testing m hypotheses H; against alternative H,, 

i =1,...,m. The FWER is the probability of incorrectly finding statistical 
significance (a type I error) in at least one of the many tests conducted. The 
goal is to test each of the individual hypotheses in such a way that the FWER is 
controlled to be at most a, so 


FWER = Pr(incorrectly reject at least one hypothesis H;) < a 


This form of control, one appropriate for econometric studies, is called strong 
control and controls the familywise error rate regardless of which null 
hypotheses are true and which are false. An alternative form of control, called 
weak control, controls FWER under the additional condition that all m 
hypotheses are true. 


Several methods have been proposed to determine the level of each 
individual test that leads to a test with a desired FWER a. 


A correction that can always be used is the Bonferroni correction that 
performs each individual test at level a; = a/m. 


For example, if m = 10 tests are conducted with FWER = 0.05, then 
each test should be at level a; = 0.05/10 = 0.005. The motivation is 
Bonferroni’s inequality that Pr(A U B) < Pr(A) + Pr(B), so if two tests are 
performed, each at level a/2, then the probability that at least one of the two 
tests rejects is at most @/2 + a/2 = a. The Bonferroni correction leads to 
conservative tests because it provides an upper bound to the FWER. 


Holm’s step-down procedure provides a less conservative refinement of 
the Bonferroni correction. The m individual p-values are ordered in 
increasing value, say, Pi, i = 1,...,m. Then, using Bonferroni’s method, we 
reject the hypothesis corresponding to the lowest p-value if pı < a/m. Given 
rejection, there are then (m — 1) remaining hypotheses to simultaneously 
test, so the next test rejects if po < a/(m — 1), and so on. Thus, we test at 
increasing test levels a; = a/(m— i + 1), stopping when p; > a;i. 


Consider the following example with a = 0.05 and m = 8. The second 
row gives the ordered test levels a; = 0.05/(9 — i) , and we suppose the 


ordered p-values are those given in the third row. 


a 1 2 3 4 5 6 7 8 
a 0.00625 0.00714 0.00833 0.010 0.0125 0.01667 0.025 0.05 
p 0.002 0.004 0.008 0.012 0.015 0.016 0.034 0.058 


Then we reject the three hypotheses corresponding to the three smallest p- 
values because pı < Q1, p2 < Q2, P3 < a3, but p4 = 0.012 > ag = 0.010. 
The individual tests use the corrected critical p-value of a, = 0.010, 
compared with 0.005 using Bonferroni’s method. 


Hochberg’s step-up method is a variant of Holm’s method that begins 
testing with the highest p-value. We reject the hypothesis corresponding to 
the largest p-value if pm < a. If we do not reject, then we test using the 
second-largest p-value and reject if p,,_1 < a/2, and so on. Thus, we 
compare p-values to decreasing test levels a,,_; = a/(m — i + 1). This 
procedure has greater power than the Holm procedure but is not valid in all 
circumstances. It is valid if the tests are independent or if the test statistics 
have a distribution with multivariate positivity of order 2, where X, and X> 
are positive dependent if Pr(X, U X2) > Pr(X1) x Pr(X2). 


Applying Hochberg’s method to the preceding example, we see the first 
two tests do not reject, because ps > Gg and p7 > a7, while the third test 
rejects because pg = 0.016 < ag = 0.01667 . The individual tests use the 
corrected critical p-value of ag = 0.01667, and we reject the six hypotheses 
with the lowest six P-values. 


By making additional assumptions, one can obtain less conservative tests. 
In the case of m statistically independent tests, each at level q*, the FWER is 
exactly a = 1 — (1 — a*)™ because the probability under the null hypothesis 
of finding no statistical significance in all m tests is (1 — a*)™. Going the 
other way, if we wish the FWER to equal a, then each of the m tests should be 
at level a* = 1 — (1 — a)1/™, called the Sidak correction. For example, if 
m = 10 tests are conducted with FWER = 0.05, then each test should be at 
level g* = 1 — 0.95!/10 = 9.90512, compared with 0.005 for the Bonferroni 
correction. The Sidak correction is exact under independence of the tests and 


is conservative if tests are nonnegatively mutually correlated. But it may 
overreject if tests are negatively correlated. The Sidak adjustment can be 
applied to some of the other methods, such as Holm’s step-down method. The 
same a values can be used to form confidence intervals. For example, 
consider Sidak’s method with m = 10 and a = 0.05 so a* = 0.00512 and 

1 —a* = 0.99478. An adjusted confidence interval following an estimation 
command can be implemented using the option level (99.48), where at most 
two decimal digits can be given in the level () option. 


The methods can also be used to adjust p-values. Let p be the usual p- 
value from regression output. Then the Bonferroni-adjusted p-value from m 
tests is the minimum of m x p and 1, while the Sidak-adjusted p-value is 
1 — (1 — p)™ and is always less than 1. For example, if m = 10 and an 
individual test has p = 0.04, then the Bonferroni-adjusted p-value is min(1, 
0.40)=0.40, while the Sidak-adjusted p-value is 1 — 0.96!° = 0.335. 


The Bonferroni and Holm methods can be adjusted to control k-FWER, the 
probability of rejecting k or more of the m hypotheses. 


The test command has an option mtest that computes these variously 
adjusted p-values, but this is limited to tests of individual regressors in a 
single-equation regression or to tests across equations in jointly estimated 
equations such as tests following the sur command. 


The community-contributed multproc command (Newson and the ALSPAC 
Study Team 2003) performs a wide range of multiple testing procedures that 
control the FWER, as well as the FDR. It requires as inputs the uncorrected P- 
values from individual tests. Newson and the ALSPAC Study Team (2003) 
provide a summary of the methods used. 


11.6.2 Subgroup analysis 


We continue with the Poisson regression example of doctor visits. We now 
break the sample into four 10-year age groups and seek to determine in which 
of the age subgroups private insurance is statistically significant at 5%, 
controlling for the multiple testing. 


The following code creates four age subgroups in age ranges 25-34, 35— 
44, 45-54, and 55—64. For illustrative purposes, the first 500 observations are 
used as the p-values for income become much smaller using all observations. 


* Multiple tests - form the age subgroups 25-34, 35-44, 45-54, 55-64 
. qui use mus210mepsdocvisyoung, clear 


. keep if _n <= 500 
(29,624 observations deleted) 


. generate agegroup = round(age)-2 


. tabulate agegroup 


// Age is in tens of years 


agegroup Freq. Percent Cum. 
1 190 38.00 38.00 
2 153 30.60 68.60 
3 107 21.40 90.00 
4 50 10.00 100.00 
Total 500 100.00 


The saved output from the poisson command does not include p-values, 
so we manually compute these using the standard normal distribution. The 
Bonferroni correction is then min(1, 4p) and the Sidák correction is 


E te) 


. * Multiple tests for multiple subgroups using p-value corrections 
. global nummodels 4 


. forvalues i = 1/$nummodels { 
2. qui poisson docvis private chronic female income 
> if agegroup=="i°, vce(robust) 
3 scalar beta = _b[private] 
4 scalar p = 2*(1-normal(abs(_b[private]/_se[private]))) 
5. scalar pbonferroni = min(1,p*$nummodels) 
6 scalar psidak = 1-(1-p)“$nummodels 
7 di "Group " `i% " b =" %7.4f beta " p-values:" " Usual =" %6. 


> " Bonferroni =" %6.3f pbonferroni " Sidak =" 
8. } 

Group 1 b = 0.8409 p-values: Usual = 0.032 Bonferroni 

Group 2 b = 1.3040 p-values: Usual = 0.000 Bonferroni 

Group 3 b = 0.0530 p-values: Usual = 0.919 Bonferroni 

Group 4 b = 1.1587 p-values: Usual = 0.024 Bonferroni 


⁄%6.3f psidak 


= 0.128 Sidak = 
= 0.002 Sidak 
= 1.000 Sidak 
= 0.096 Sidak 


3f p 


0.122 
0.002 
1.000 
0.093 


Using the usual unadjusted p-values, we see private insurance appears 
statistically significant at level 0.05 for those in all age groups except the 45— 
54 bracket. Once we correct for multiple testing, however, private insurance 
is statistically significant only in the 35—44 age bracket. 


Holm’s method yields the same result. The ordered p-values are 0.000, 
0.024, 0.032, and 0.919, while the ordered test levels a; = 0.05/(5 — i) are 
0.0125, 0.01667, 0.025, and 0.05. We use a = 0.01667 because 
0.000 < 0.0125 but 0.024 > 0.01667, and only the 45-54 bracket has 
p < 0.01667. 


multproc community-contributed command 


The community-contributed multproc command (Newson and the ALSPAC 
Study Team 2003) uses as inputs the P-values from various tests. Here we 
apply the command to the preceding four subgroup tests, using the 

method (holm) option to implement Holm’s step-down method. The various 
method() options cover 12 different multiple testing methods. 


The example demonstrates various options of the command, though only 
the first three options are essential. 


. * Multiple tests using community-contributed command multproc and Holm’s method 
. input pvalues 


pvalues 
.032 
.000 
.919 
.024 


oOPWNF 


. end 
. multproc, puncor(0.05) pvalue(pvalues) method(holm) rank(prank) 
> gpuncor (alpha) gpcor(indivalpha) nhcred(credible) reject (reject) 


Method: holm 

Uncorrected overall critical P-value: .05 

Number of P-values: 4 

Corrected overall critical P-value: .01666667 

Number of rejected P-values: 1 

. list pvalues indivalpha reject credible prank alpha in 1/4, clean 


pvalues indival”a reject credible prank alpha 


1. .032 . 01666667 0 1 3 .05 
2. 0 . 01666667 1 0 1 .05 
3. .919 . 01666667 0 1 4 .05 
4. .024 . 01666667 0 1 2 .05 


As in the preceding results, the corrected critical value is 0.01667 and the null 
hypothesis of zero coefficient for private is rejected in only the second age 


group. 


mtest option of the tests command 


Stata’s test command with the mtest option is intended for use after a single 
regression, rather than from four separate regressions. 


So we combine the four separate regressions into a single regression with 
regressors that are fully interacted with variable agegroup. To ensure a 
different intercept in each regression, we manually include an intercept 
(variable one) and then add the noconstant option. The complicated names 
for each coefficient in this interacted regression can be obtained in several 
ways, including using command matrix list e(b). We use the option 
mtest (sidak), which gives Sidak-corrected p-values. These are valid here 
because the four subgroups are independent of each other. 


. * Multiple tests for multiple subgroups using mtests option of test 
. qui use mus210mepsdocvisyoung, clear 

. keep if _n <= 500 

(29,624 observations deleted) 


. generate agegroup = round(age)-2 // Age is in tens of years 
. generate one = 1 


. qui poisson docvis i.agegroup#c.(one private chronic female income), 
> noconstant vce(robust) 


. qui matrix list e(b) // Yields the complicated names for the coefficients 


. test ([docvis]:1b.agegroup#c.private) ([docvis]:2.agegroup#c.private=0) 
> ([docvis] :3.agegroup#c.private=0) ([docvis] :4.agegroup#c.private=0) , 
> mtest (s) 


1) [docvis]ib.agegroup#c.private = 0 
2) [docvis]2.agegroup#c.private = 0 
3) [docvis]3.agegroup#c.private = 0 
4) [docvis]4.agegroup#c.private = 0 


PO Pe FO 


chi2 df p > chi2 


0.1209* 
0.0015* 
1.0000* 
0.0882 


(1) 
(2) 
(3) 
(4) 


4.62 
12.61 
0.01 
5.18 


PPP Pe 


All 22.41 4 0.0002 


* Sidak-adjusted p-values 


The Poisson coefficients from this combined regression, not reported, are 
exactly the same as those obtained from running separate regressions for each 
of the four subgroups. The robust standard errors differ slightly, because they 
are computed using a finite-sample correction factor N/(N — K) that equals 
500/(500 — 20) for the combined sample and, for example, 190/(190 — 5) in 
the youngest age group. Thus, the Sidak-corrected p-values are slightly 
different from those obtained earlier. 


11.6.3 Multiple outcomes 


As a pedagogical example using the same dataset, we consider the outcomes 
private insurance, whether injured, and years of schooling. For code brevity, 
all three models are fit by OLS, though in practice one would use, for 
example, Poisson, logit, and oLs for these three different types of outcomes 


. x Multiple tests for multiple outcomes using p-value corrections 
. global nummodels 3 


. global ylist docvis injury educ 


. foreach var of varlist $ylist { 
2. qui regress “var” private chronic female income, vce(robust) 


3. scalar p = 2*ttail(e(df_r) ,abs(_b[income]/_se[income] )) 

4. scalar pbonferroni = min(1,p*$nummodels) 

5, scalar psidak = 1-(1-p)~$nummodels 

6. di "Outcome = " "“var~" _col(19) " p-values:" " Usual =" %7.4f p 
> " Bonferroni =" %7.4f pbonferroni " Sidak =" %7.4f psidak 

come 
Outcome = docvis p-values: Usual = 0.2334 Bonferroni = 0.7003 Sidak = 0.5495 
Outcome = injury p-values: Usual = 0.0314 Bonferroni = 0.0943 Sidak = 0.0914 


Outcome = educ p-values: Usual = 0.0000 Bonferroni 0.0000 Sidak = 0.0000 


Again, the p-values can become much larger when we adjust test size for 
multiple testing, and income is statistically significant in only the educ 
regression if the FWER is set to 0.05. 


And again, the community-contributed multproc command could be used 
to implement various multiple testing procedures, such as Holm’s step-down 
procedure. 


11.6.4 False discovery rate 


The FWER controls the probability of erroneously finding statistical 
significance in even one of the tests conducted. This approach is therefore 
very stringent when many tests are being conducted, such as determining 
which of many genes is associated with a particular cancer. 


An alternative approach allows a fraction of the tests to erroneously find 
statistical significance, leading to an increase in test power at the expense of 
increased test size. 


For multiple tests, define the false discovery proportion (FDP) to be the 
proportion of rejected hypotheses that are falsely rejected, so 


FDP = F/R 


where F = number of false rejections, R = total number of rejections, and 
FDP = 0 if R = 0. One possible approach controls the probability that the 
FDP exceeds a prespecified fraction y, where y = 0 corresponds to controlling 
the FWER. 


Benjamini and Hochberg (1995) instead proposed controlling the FDR, 
which is defined to be the expected FDP; that is, 


FDR = E(FDP) 


Their method orders tests by p-value from smallest to largest, so 

pı < P2 < +++ < Pm. We then reject the corresponding ordered hypotheses 
H,,..., Hx, where k is the largest į for which p; < FDR x i/m and the FDR is 
the prespecified FDR for the multiple tests. If the multiple tests are 
independent, then the size of the test equals FDR. 


For example, suppose we perform 100 tests and the desired FDR is 10%. 
Then FDR = 0.10, and the ordered critical values are 0.001, 0.002, 0.003,.... 
If the seventh-ordered p-value is 0.0065 and the eighth-ordered p-value is 
0.0084, then we would reject the seven hypotheses with the seven lowest p- 
values. 


The Benjamini—Hochberg procedure assumed that tests are statistically 
independent. Benjamini and Yekutieli (2001) showed that the test remains 
valid under some forms of dependence but is then conservative. Furthermore, 
Benjamini and Yekutieli (2001) proposed a more conservative variation of 
the method that is valid for any form of dependence. 


The community-contributed add-on multproc performs a wide range of 
multiple test procedures that control for FDR, in addition to FWER, presented 
earlier. Newson and the ALSPAC Study Team (2003) provide a summary of the 
methods used. 


11.6.5 More powerful multiple tests 


In practice, hypothesis tests are not independent and are often positively 
correlated. Failure to adjust for this correlation leads to conservative tests 
with actual test size being much less than the nominal test size and hence 
leads to a substantial loss of test power. This has led to a range of refinements 
to multiple testing methods and is still an active area of research. 


Many of the improvements in power use resampling methods to account 
for dependence. Romano and Wolf (2005) provide an overview. Anderson 
et al. (2008) provide an application of a resampling-based step-down 
procedure for both FWER and FDR based on adjusted p-values. 

Farcomeni (2008) provides an extensive guide to the multiple testing 
literature. List, Shaikh, and Xu (2019) consider the special case of 
experimental data in which units are assigned to treatments and control using 
simple random sampling. They use a bootstrap-based procedure for multiple 
testing that asymptotically controls the FWER and has much more power than 
existing methods because it incorporates information about dependence 
between the multiple tests. The authors provide a community-contributed 
command mhtexp to implement this method. 


11.7 Test size and power 


We consider computation of the test size and power of a Wald test by Monte 
Carlo simulation. The goal is to determine whether tests that are intended to 
reject at, say, a 0.05 level really do reject at a 0.05 level and to determine the 
power of tests against meaningful parameter values under the alternative 
hypothesis. This extends the analysis of section 5.6, which focused on the 
use of simulation to check the properties of estimators of parameters and 
estimators of standard errors. Here we instead focus on inference. 


11.7.1 Simulation DGP: OLS with chi-squared errors 


The DGP is the same as that in section 5.6, with data generated from a linear 
model with skewed errors, specifically, 


y =b +pr+u; urx*(I)-1, 2~x7(1) 
where 3, = 1, 2 = 2 and the sample size N = 150. The [y?(1) — 1] errors 
have a mean of 0 and a variance of 2 and are skewed. 
In each simulation, both y and x are redrawn, corresponding to random 
sampling of individuals. We investigate the size and power of ¢ tests on 


Ho: Bə = 2, the DGP value after OLS regression. The tests are based on 
default standard errors. 


11.7.2 Test size 


In testing Ho, we can make the error of rejecting Hp when Hp is true. This is 
called a type I error. The test size is the probability of making this error. 
Thus, 


Size = Pr(Reject Ho|Ho true) 


The reported p-value of a test is the estimated size of the test. Most 
commonly, we reject Ho if the size is less than 0.05. 


The most serious error is one of incorrect test size, even asymptotically, 
because of, for example, the use of inconsistent estimates of standard errors 
if a Wald test is used. Even if this threshold is passed, a test is said to have 
poor finite-sample size properties or, more simply, poor finite-sample 
properties, if the reported p-value is a poor estimate of the true size. Often, 
the problem is that the reported p-value is much lower than the true size, so 
we reject Ho more often than we should. 


For our example with DGP value of G2 = 2, we want to use simulations to 
estimate the size of an a-level test of Ho: Go = 2 against Ha: B2 Æ 2. In 
section 5.6.2, we did so when a = 0.05 by counting the proportion of 
simulations that led to rejection of Ho at a level of a = 0.05. The estimated 
size was 0.050 because 50 of the 1,000 simulations led to rejection of Ho. In 
general, we do not expect exactly 50 simulations to reject because there is 
simulation randomness. 


A computationally more efficient procedure is to compute the p-value for 
the test of Ho: Go = 2 against H,: Go Æ 2 in each of the 1,000 simulations 
because the 1,000 p-values can then be used to estimate the test size for a 
range of values of a, as we demonstrate below. The p-values were computed 
in the chi2data program defined in section 5.6.1 and returned as the scalar 
p2, but these p-values were not used in any subsequent analysis of test size. 
We now do so here. 


The simulations in section 5.6 were performed by using the simulate 
command. Here we instead use post file and a forvalues loop; the code is 
for the 1,000 simulations: 


. * Do 1,000 simulations where each gets p-value of test of b2=2 
. set seed 10101 


. postfile sim pvalues using pvalues, replace 


. forvalues i = 1/1000 { 
2. drop _all 


3. qui set obs 150 

4. qui generate double x = rchi2(1) 

5. qui generate y = 1 + 2*x + rchi2(1)-1 

6. qui regress y x 

T. qui test x = 2 

8. scalar p = r(p) // p-value for test this simulation 
9. post sim (p) 

10. } 


. postclose sim 


The simulations produce 1,000 p-values that range from 0 to 1. 


* Summarize the p-value from each of the 1,000 tests 
. use pvalues, clear 


summarize pvalues 


Variable Obs Mean Std. dev. Min Max 


pvalues 1,000 . 5027676 . 2836176 . 0000208 . 9994959 


These should actually have a uniform distribution, and giving the command 
histogram pvalues reveals that this is the case. 


Given the 1,000 values of pvalues, we can find the actual size of the test 
for any choice of a. For a test at the œ = 0.05 level, we obtain 


* Determine size of test at level 0.05 
. count if pvalues < .05 
50 


. display "Test size from 1000 simulations = " r(N)/1000 
Test size from 1000 simulations = .05 


The actual test size equals the nominal size of 0.05, though more generally 
we expect some discrepancy due to simulation randomness. Furthermore, it 
is exactly the same value as that obtained in section 5.6.1 because the same 
seed and sequence of commands were used there. 


The test size is not estimated exactly because of simulation error. If the 
true size equals the nominal size of a, then the proportion of times Hp is 


rejected in § simulations is a random variable with a mean of a and a 
standard deviation of \/a(1 —a)/S ~ 0.007 when § = 1000 and a = 0.05. 
When we use a normal approximation, the 95% simulation interval for this 
simulation is [0.036, 0.064], and the value we obtained of 0.050 is within 
this interval. More precisely, the cii command yields an exact binomial 
confidence interval. 


. * 95% simulation interval using exact binomial at level 0.05 with S=1000 
. cii proportions 1000 50 


Binomial exact 
Variable Obs Proportion Std. err. [95% conf. interval] 


1,000 .05 . 006892 .0373354 .0653905 


With S = 1000, the 95% simulation interval is [0.037, 0.065]. With 
S = 10000 simulations, this interval narrows to [0.046, 0.054]. 


In general, tests rely on asymptotic theory, and we do not expect the true 
size to exactly equal the nominal size unless the sample size N is very large 
and the number of simulations § is very large. In this example, with 150 
observations and only 1 regressor, the asymptotic theory performs well even 
though the model error is skewed. 


11.7.3 Simulation using actual data 


In the preceding example, the dataset was one artificially constructed from a 
completely specified DGp. What if one instead wants to base the simulation 
on data very similar to the dataset at hand. 


One approach is to fit a regression model using actual data and then use 
this model to generate in each simulation round a new vector of dependent 
variables. 


Building on the preceding example, we might with real data do OLS 
regression of y; on X; giving estimates B and G2. Then, in the sth simulation 
generate yj, = x! 3 + uis, i = 1,...,.N, where uis is a draw from a right- 
skewed distribution with mean 0 and variance G2. OLS estimation then gives 
the sth estimate go One could then, for example, test at 5% in each 


simulation the hypothesis that 8;, the jth component of 3, equals the DGP 
value of 8;, which is just the original OLS estimate By. The hypothesis should 


be rejected in 5% of simulations. A variation could generate data with 
B; = 0 and test whether 6; = 0. 


A second approach called a placebo approach does the following. Let zi 
denote the key regressor of interest and X; denote the other control variables. 
In the sth simulation, generate z;,,i = 1,..., N, where Zis has properties 
similar to those of the original dataset z;. Then, in each simulation, perform 
OLS regression of Yi on Zis and X;. Because Zis was generated independently 
of the data on yi and X;, we expect no relationship so that tests at level 0.05 
of the hypothesis that the coefficient of z; equals 0 should reject 5% of the 
time. 


For simplicity, we considered OLS estimation with independent data, but 
the methods can clearly be adapted to other estimators and to correlated data 
such as clustered data. 


11.7.4 Test power 


A second error in testing, called a type II error, is to fail to reject Ho when 
we should reject Hp. The power of a test is one minus the probability of 
making this error. Thus, 


Power = Pr(Reject Ho|Ho false) 


Ideally, test size is minimized and test power is maximized, but there is a 
tradeoff with smaller size leading to lower power. The standard procedure is 
to set the size at a level such as 0.05 and then use the test procedure that is 
known from theory or simulations to have the highest power. 


The power of a test is not reported because it needs to be evaluated at a 
specific H, value, and the alternative hypothesis H, defines a range of 
values for 3 rather than one single value. 


We compute the power of our test of 8, = 2/¢ against H,: B2 = B2!¢, 
where 32/2 takes on a range of values. We do so by first writing a program 
that determines the power for a given value 3° and then calling this 
program many times to evaluate at the many values of 6/77. 


The program is essentially the same as that used to determine test size, 
except that the command generating Y becomes generate y = 1 + b2Ha*x + 
rchi2(1)-1. We allow more flexibility by allowing the user to pass the 
number of simulations, sample size, Ho value of G2, Ha value of 62, and 
nominal test size (aœ) as the arguments, respectively, numsims, numobs, b2HO, 
b2Ha, and nominalsize. The r-class program returns the computed power of 
the test as the scalar p. We have 


* Program to compute power of test given specified HO and Ha values of b2 
program mypower, rclass 
version 17 
args numsims numobs b2HO b2Ha nominalsize 
// Setup before simulation loops 
drop _all 
set seed 10101 
postfile sim pvalues using power, replace 
// Simulation loop 
forvalues i = 1/"numsims” { 
drop _all 
quietly set obs “numobs~ 
quietly generate double x = rchi2(1) 
quietly generate y = 1 + ~b2Ha°*x + rchi2(1)-1 
quietly regress y x 
quietly test x = `b2H07 
scalar p = r(p) 
post sim (p) 
} 
postclose sim 
use power, clear 
// Determine the size or power 
quietly count if pvalues < “nominalsize™ 
return scalar power=r(N)/*~numsims~ 
end 


This program can also be used to find the size of the test of Hp: Bə = 2 
by setting 6//¢ — 2. The following command obtains the size using 1,000 
simulations and a sample size of 150 for a test of the nominal size 0.05. 


. * Size = power of test of b2HO=2 when b2Ha=2, S=1000, N=150, alpha=0.05 
. mypower 1000 150 2.00 2.00 0.05 
(file power.dta not found) 


. display r(power) " is the test size" 
.05 is the test size 


The program power uses exactly the same coding as that given earlier for 
size computation; we have the same number of simulations and same sample 
size, and we get the same size result of 0.05. 


To find the test power, we set B, = B#*, where G//¢ differs from the null 
hypothesis value. Here we set 84% = 2.2, which is approximately 2.4 
standard errors away from the Ho value of 2.0 because, from section 5.6.1, 
the standard error of the slope coefficient is 0.084. We obtain 


. * Power of test of b2HO=2 when b2Ha=2.2, S=1000, N=150, alpha=0.05 
. mypower 1000 150 2.00 2.20 0.05 


. display r(power) " is the test power" 
.663 is the test power 


Ideally, the probability of rejecting Hp: G2 = 2.0 when By = 2.2 is 1. In 
fact, it is only 0.663. 


We next evaluate the power for a range of values of 82, here from 1.60 
to 2.40 in increments of 0.025. We use the post file command, which was 
presented in section 5.3.4: 


. * Power of test of HO:b2=2 against Ha:b2=1.6,1.625, ..., 2.4 
. postfile simofsims b2Ha power using simresults, replace 
(file simresults.dta not found) 


. forvalues i = 0/33 { 


2. 
3. 
4. 
5. 
6. } 


drop _all 

scalar b2Ha = 1.6 + 0.025*`i7 
mypower 1000 150 2.00 b2Ha 0.05 
post simofsims (b2Ha) (r(power)) 


. postclose simofsims 


. use simresults, clear 


summarize 
Variable Obs Mean Std. dev. Min Max 
b2Ha 34 2.0125 . 2489562 1.6 2.425 
power 34 .6109412 . 3480345 .05 .995 


The simplest way to see the relationship between power and 3 is to plot 
the power curve. 


* Plot the power curve 
. twoway (connected power b2Ha), scale(1.2) plotregion(style(none) ) 
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Figure 11.2. Power curve for the test of Ho: 82 = 2 against 
Ha: B2 # 2 when ba takes on the values B//¢ = 1.6,...,2.4 


under H, and N = 150 and S = 1000 


As you can see in figure 11.2, power is minimized at 6//¢ = BH — 2, and 
then the power size equals 0.05, as desired. As |3//2 — @3/°| increases, the 
power goes to 1, but power does not exceed 0.9 until |G#% — B#°| > 0.3. 


The power curve can be made smoother by increasing the number of 
simulations or by smoothing the curve by, for example, using predictions 
from a regression of power on a quartic in b2Ha. Alternatively, by 
appropriately writing the code to compute power, one can obtain graphs and 
tables using options of Stata’s power command; see [PSs-2] power 
usermethod. 


11.7.5 Asymptotic test power 


The asymptotic power of the Wald test can be obtained without resorting to 
simulation. We do this now for the square of the ¢ test. 


We consider W = {(Bo = 2)/sg }. Then W Xx x? (1) under 
Ho: b2 = 2. It can be shown that under H,: B2 = B#%, the test statistic 
W & noncentral y?(1; A), where the noncentrality parameter 
A = (87° — 2)"/o% . [If y ~ N(6, I), then (y — 6)/(y — 6) ~ x?(1) and 
y’y ~ noncentral y? (h; 8'8), where h = dim(y).] 


We again consider the power of a test of G2 = 2 against B2/¢ = 2.2. 
Then A = (474 — 2°)? /o% = (0.2/0.084)? = 5.67, where we recall the 
earlier discussion that the DGP is such that 73, = 0.084. A y?(1) test rejects 
at a level of œ = 0.05 if W > 1.962 = 3.84. So the asymptotic test power 
equals Pr{W > 3.84|W ~ noncentral y?(1;5.67)}. The nchi2() function 
gives the relevant c.d.f., and we use 1-nchi2() to get the right tail. We have 
. * Power of chi(1) test when noncentrality parameter lambda = 5.634 


. display 1i-nchi2(1,5.634,3.841) 
. 66048164 


The asymptotic power of 0.660 is similar to the estimated power of 0.663 
from the Monte Carlo example of the preceding subsection. This closeness 


of asymptotic power to finite sample power is due to the relatively large 
sample size with just one regressor. 


11.8 The power onemean command for multiple regression 


Let ĝo and 6, denote values of @ under, respectively, the null and alternative 
hypotheses. Then the difference 


5 = (ba — %) (11.9) 


is called the effect size. In econometrics applications, it is common for 
69 = 0. 


Power calculations, done especially in design of a lab experiment or field 
experiment, consider three standard calculations for a test of size a. 


1. Calculate power (7) for a given effect size (6) and a given sample size ( 
N). 

2. Calculate minimum effect size necessary to attain a given level of 
power with a given sample size. 

3. Calculate the minimum sample size to attain a given power and desired 
effect. 


The power commands provide such calculations for tests on means, 
proportions, correlations, variances, and difference in these quantities, in an 
i.i.d. setting. Section 24.3 presents in detail the power twomeans command 
for power calculations for a binary treatment. The power command also 
includes methods for analysis of variance, linear regression (including slope 
coefficient in bivariate regression with 1.1.d. normal errors), contingency 
tables, and survival analysis. Additionally, one can add one’s own method to 
the power command. 


In this section, we show how to adapt the power onemean command to 
the more general case of linear or nonlinear multiple regression that yields 
an estimate ĝ that has an asymptotic normal distribution or, in the case of 
clustered data, has distribution approximated by a ¢ distribution with degrees 
of freedom the number of clusters minus one. 


11.8.1 The power onemean command 


The power onemean command was designed for testing hypotheses on the 
population mean in an 1.1.d. sample. To compute test power in this setting, 
use the command 


power onemean m0 ma, n(numilist) | sd (numilist) options | 


where m0 and ma are values of the population mean under the null and 
alternative hypothesis, sd() is the standard deviation for a single 
observation, and n() is the number of observations. 


To instead compute minimum effect size necessary to attain a given level 
of power for a particular test size, add the power () option and drop the ma. 


power onemean m0, n(numlist) power (numiist) | sd (numlist) options | 


And to compute minimum sample size necessary to attain a given level 
of power for a given effect size and test size, add the power () option and 
drop the n() option. 


power onemean m0 ma l, power (numlist) sd(numlist) options | 


The default is a two-sided test of size 0.05 using the t(N — 1) 
distribution. Options include alpha () to set the size of the test to a value 
different from 0.05, knownsd to give the power of a standard normal test, and 
onesided for a one-sided test. By providing a numilist, one can perform 
calculations over a range of values. The table() option can provide more 
compact output. The graph () option can plot results across a range of 
specified power, sample size, or effect size values. 


Many of the other power commands have similar options, though most 
use the normal distribution rather than the ¢ distribution. For several power 
commands, the option rho () allows for clustered data, where rho () is the 
intracluster correlation. 


11.8.2 Power in a regression setting using the standard normal 


For bivariate linear regression with 1.1.d. normal the power oneslop 
command calculates power given, for example, the standard deviations of 
the regressor and errors. We instead analyze a much more general regression 
setting. 


Consider a two-sided Wald test at level a of Ho : 0 = 0o for general 
estimator 0 ~ N (0, 53), where $% is the standard error of @. 


When standard normal critical values are used, the power (7) of this test 
against H, : 0 = ĝa equals the probability that (0 — 09)/83| > Za/2 when 
es N (0a, 52) because then (0 — 0a) / 53 ~ N(0,1). It follows that 


B = ®(y — za/2) + B(-7 —Za/2), Y= 6/89 = (Oa — b0) /sç (11.10) 


where by %a/2 we mean the area in the right tail is #a/2. 


Statistical significance test with size 0.05 and power 0.80 


Before applying the power command, we note that a commonly used 
threshold in experimental design is that the effect size 6 = (0, — 09) should 
be large enough that a test of size 0.05 has power of at least 0.80. Solving 
the preceding equation for Y when q = 0.05 and 7 = 0.80 yields 

y = 2.8016. 


For a two-sided test of statistical significance (89 = 0), this norm means 
we should be able to detect an effect of magnitude @,/s 7 = 2.8016 or one 
that is at least 2.8016 standard errors from 0. By comparison, the usual norm 
in econometric studies is to ignore power and use a much lower threshold of 
20.025 = 1.9600. The higher threshold of 2.8016 corresponds to determining 
a meaningful effect if the test p-value is less than 0.005 because 
20.0025 = 2.8070. Thus, and also because of concerns about data mining and 
model mining, some leading researchers in some branches of applied 
statistics such as psychology are advocating that statistical significance tests 
of new discoveries should use a threshold of p < 0.005 or, equivalently, 
|z| > 2.80. 


Similar calculations for a one-sided test of statistical significance yields 
Y = 94/87 = 2.4865, so statistical significance effects should use a 
threshold of p < 0.0065 because zo o9g5 = 2.4838 or, equivalently, 
|z| > 2.4865. 


Compute test power for given effect size and test size 


The power onemean command, designed for statistical inference on the mean 
in an 1.1.d. sample, can be adapted to the current more general case of 
inference on 9. To do so, we use options knownsd n(1) sd($g). More 
generally, we can equivalently use options n (w) sd(./w x sọ) for any 
positive integer w; it is simplest to use w = 1. Note that options n() and 
sd() require numbers or lists of numbers, not expressions. 


We continue with the example of the preceding section. Then 
sg = 0.0843, and we perform a two-sided test of 09 = 2 and obtain power 
against 0, = 2.2. We obtain 


. * Power of Wald normal test: 2 versus 2.2, size 0.05, s_thetahat=0.0843 
. power onemean 2 2.2, knownsd alpha(0.05) n(1) sd(0.0843) 


Estimated power for a one-sample mean test 
z test 
HO: m = mO versus Ha: m != m0 


Study parameters: 


alpha = 0.0500 

N = 1 

delta = 2.3725 

mO = 2.0000 

ma = 2.2000 

sd = 0.0843 

Estimated power: 

power = 0.6600 


We obtain asymptotic power of 0.660 as in the previous subsection. 


Because we used options n (1) sd(5g), the variable delta in the output 
measures the effect size in units of standard errors: 
(0a — 90) / s9 = (2.2 — 2.0) /0.0843 = 2.3725. 


Compute minimum effect size for given test size and power 


In this example, the power is still a long way from the desired 1.0. Just as a 
common choice of test size is 0.05, a common choice in the biostatistics 
literature for desired power is 0.80. Combining, we see the probability of a 
type 1 error is then 0.05, while the probability of a type 2 error is at most 
0.20. 


We wish to obtain the minimum effect size that would give power of 
0.80, where effect size is given by (8a — 0o), for a test of size 0.05. The 
power onemean command can be used, adding the power () option and 
dropping the ma option. Continuing the preceding example, we have 


. * Min effect size of normal test: 2 versus ?, size 0.05, power 0.80, s_t=.0843 
. power onemean 2, knownsd alpha(0.05) power(0.80) n(1) sd(0.0843) 
> table(alpha power mO ma sd N delta, formats(power "%47.3f")) 


Performing iteration ... 


Estimated target mean for a one-sample mean test 
z test 
HO: m = mO versus Ha: m != m0; ma > m0 


alpha power mO ma sd N delta 


2.236 .0843 


The minimum effect size is 2.802, so we need ĝ to be at least 2.802 standard 
errors from the null hypothesis value of 2.0, which requires @ > 2.236. 


As already noted, the usual test for statistical significance at 5% has the 
less stringent requirement that g be at least 1.960 standard errors from 2.0, 


or ĝ > 2.0 + 1.960 x 0.0843 = 2.165- 


Compute minimum sample size for desired effect size and given size and power 


Now suppose we want to determine the sample size that achieves a given 
level of test size and power for a desired effect size. 


Unlike the preceding examples, we need to make the additional 
assumption that an m times increase in sample size leads to the standard 
error of g being 1/m times as large. This may be reasonable when 
observations are independent; more difficult adaptation to correlated 
observations is given in a later subsection. 


For power and effect size calculations, we could use options n (w) sd( 
yw x sọ) for any integer w; we used w = 1 for simplicity. For minimum 
sample-size computations, we need to use w = N, where N is the sample 
size. 


In the preceding examples, the standard error of 0.0843 is based on a 
sample size of 150. So we use option sa (1.0325) because 


VN x s = V150 x 0.0843 = 1.0325. We additionally use the table () 
option to obtain more compact output. 


. * Min sample size for normal test: 2 versus 2.2, size 0.05, power 0.80, 
> independent 
. di sqrt(150)%*0.0843 


1.0324599 
. power onemean 2 2.2, knownsd alpha(0.05) power(0.80) sd(1.0325) 
> table(alpha power mO ma sd N delta, formats(power "%7.3f")) 


Performing iteration ... 


Estimated sample size for a one-sample mean test 
z test 
HO: m = mO versus Ha: m != mỌ 


alpha power mO ma sd N delta 


.05 0.800 2 2.2 1.032 210 . 1937 


We need at least 210 observations. 


Compute a power curve 


The power command allows some of the arguments to be given as number 
lists that span a range of values. 


For example, suppose we want to plot a power curve of asymptotic 
power against the number of standard errors that the alternative hypothesis 
value is from the null hypothesis value. In that case, we set sa equal to 1 and 
the m0 value to 0. To consider alternative values that are up to 4 standard 
errors from the Hp value, for example, give a range of values of ma from 
— 4 to 4, here at intervals of 0.1. The graph () option of the power onemean 
command is used to plot the power curve. 


We obtain 


. * General power curve for N(0,1) test at size = 0.05 
. Clear 


power onemean O (-4(0.1)4), knownsd sd(1) n(1) alpha(0.05) 
graph(legend(off) scale(1.2) title("") subtitle("") 
note("") recast (line) 
ylabel(, grid angle(0)) yline(0.05) xlabel(, grid) 
xtitle("Ha: Number of standard errors from HO value") 
ytitle("Test power") ) 


VVVVMs 


The first panel of figure 11.3 plots the power curve. As expected, the 
power curve for a test at level 0.05 takes a minimum value of 0.05 at the null 
hypothesis value of 9. Power is only 0.50 at the 5% test critical value of 
1.96. Power is 0.80 when the alternative value is approximately 2.8 standard 
errors from the null hypothesis value. 
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Figure 11.3. Power curves at test size 0.05 and at sizes 0.05 and 
0.10 


The second panel of figure 11.3 plots power curves at two different test 
sizes. The test at size œa = 0.10, given by the dashed line, has uniformly 
higher power than the test at size œ = 0.05; increasing the probability of a 
type 1 error decreases the probability of a type 2 error. The critical value for 
a test of size œ = 0.10 is 2.050 = 1.645, and the power at 1.645 standard 
deviations equals approximately 0.50, as expected. 


11.8.3 Power in a regression setting using the t distribution 


We now consider the same two-sided Wald test at level a of Ho : 0 = 6, for 
general estimator 0 ~ N (0, 53) but now use critical values from the students 


t distribution with v degrees of freedom, where typically v = N — K when 
observations are independent and v = G — 1 when observations are 
clustered with G clusters. 


We compute the power of this test against H, : 6 = 6,. The power is the 
probability that (0 — 6o)/ szl a E when the distribution of g@ is centered 
on ĝa rather than on 69. Translating the centering of a normally distributed 
random variable leads again to a normally distributed variable. By contrast, 
translating the centering of a ¢-distributed random variable leads to a random 
variable with the noncentral ¢ distribution. The noncentrality parameter is the 
amount of the translation, here y = (Oa — 90) /sç. Then, 


n= 1= Toataua) T Tial tena), IE (Oa = 0o) /sg (11.11) 


where Ty ~ is the c.d.f. of the noncentral ¢ distribution with v degrees of 
freedom and noncentrality parameter 7, and by ¢,,a/2, we mean the area in 
the right tail of the central t(v) distribution is a//2. 


Compute test power for given effect size and test size 


The power onemean command, designed for inference on the mean using the 
t(n — 1) distribution, is easily adapted to this more general setting for power 
and effect size calculations with constant sample size. We use options n ( 


v +1) sd(Vvv+1 x sọ). 


For example, suppose our current example came from clustered data 
with 12 clusters, so we base inference on the ¢(11) distribution. Then, we 
use options n (12) and sd(0.29202) because v + 1 = 12 and 
v12 x 0.0843 = 0.29202. We obtain 


. * Power of Wald t(11) test: 2 versus 2.2, size 0.05, s_thetahat=0.0843 
. di sqrt (12) *0.0843 
. 29202377 


. power onemean 2 2.2, alpha(0.05) n(12) sd(0.29202) 
>  table(alpha power mO ma sd N delta, formats(power "/%7.3f")) 


Estimated power for a one-sample mean test 
t test 
HO: m = mO versus Ha: m != m0 


alpha power 


.05 0.580 2 2.2 .292 12 . 6849 


The power for the ¢ test is 0.580, lower than 0.660 for the standard normal 
test. Such differences become greater as the degrees of freedom fall. Note 
that the result delta no longer measures the effect size in units of standard 
errors. This becomes especially relevant in clustered settings considered 
below. 


Compute minimum effect size for given test power and size 


We next obtain the minimum effect size that would give power of 0.80, 
where effect size is given by (0, — 60), for a test of size 0.05. Continuing the 
preceding t(11) example, we again use options n(12) and sd (0.29202) 
because v + 1 = 12 and 4/12 x 0.0843 = 0.29202. We obtain 


. * Min effect size of t(11) test: 2 versus ?, size 0.05, power 0.80, s_t=.29202 
. power onemean 2, alpha(0.05) power(0.80) n(12) sd(0.29202) 
> table(alpha power mO ma sd N delta, formats(power "%47.3f")) 


Performing iteration ... 


Estimated target mean for a one-sample mean test 
t test 
HO: m = mO versus Ha: m != m0; ma > mO 


alpha power mO ma sd N delta 


.05 0.800 2 2.26 .292 12 . 8887 


The minimum value of 6, is 2.260, so the minimum effect size is 

(0, — 09) = (2.260 — 2.0) = 0.260, compared with 0.236 using the standard 
normal. Using the finite sample adjustment, we require stronger evidence 
than if using asymptotic results. Equivalently, the reported § equals the 


minimum effect size multiplied by yv + 1, so the minimum effect size 
equals 0.8887/./12 = 0.260. We need @ to be at least 0.260/0.0843 = 3.08 
standard errors from the null hypothesis value of 2.0. 


Compute minimum sample size for desired effect size and given size and power 


When ¢ tests are used, extending the power onemean command to regression 
estimate g@ is complicated. 


The simplest approach may be to instead directly compute the power for 
varying sample sizes and associated values of 5g and degrees of freedom. 


Thus, consider the example with a = 0.05, sẹ = 0.0848, @ with 
sọ = 0.843, and inference based on the ¢(11) distribution. Then, using 
(11.11), we see the test power is 


. * Power for t(11) with specified test size and effect size 
. scalar df = 11 


. scalar se = 0.0843 

. scalar effectsize = 0.2 

. scalar ncp = effectsize/se 

. scalar size = 0.05 

. scalar tcrit = invttail(df, size/2) 


. display "power = " 1 - nt(df, ncp, tcrit) + - nt(df, ncp, -tcrit) 
power = .5804467 


The power is 0.580, as was obtained previously applying the power onemean 
command to the same example. 


Now suppose we double the sample size and assume that this leads to 
Var (0) halving, so the initial 5g is divided by ,/2. And suppose that the 
degrees of freedom increase to 23, which would be the case if the number of 
clusters increased from 12 to 24, using G — 1 degrees of freedom. 


* Power with df increase from 11 to 23 (for example, double the number of 
clusters) 
. scalar df = 23 


. scalar se = 0.0843/sqrt (2) 


. scalar effectsize = 0.2 


Ve 


. scalar ncp = effectsize/se 
. scalar size = 0.05 
. scalar tcrit = invttail(df, size/2) 


. display "power = " 1 - nt(df, ncp, tcrit) + - nt(df, ncp, -tcrit) 
power = .89475003 


The power has increased to 0.895. With trial and error, we could find the 
sample size so that power equals 0.80. 


This example is of most importance in the clustered case, where degrees 
of freedom are often small. 


11.8.4 Power for clustered data 


The power onemean, cluster command permits clustered data, assuming 
equicorrelation within cluster. The rho () option specifies the intracluster 
correlation, the k() option specifies the number of clusters, the m() option 
specifies the cluster size, and the cvcluster() option specifies the 
coefficient of variation for cluster sizes should clusters not be equal sized. 


These options are not suited to the current focus of extension to a slope 
coefficient in multiple regression, so they are not presented here. The power 
twomeans command with clustering options is presented in section 24.3 for 
an randomized control trial with a binary treatment. 


Instead, we continue with the preceding approach, using the ¢ 
distribution rather than the normal distribution to better approximate 
distributions when there are few clusters. 


With clustered data, especially with few clusters, one should use the 
t(G — 1) distribution, where G is the number of clusters, rather than the 
standard normal distribution. 


Compute test power and compute effect size 


The preceding power and minimum effect size analysis for ¢ distribution 
carries through, with v = G — 1 degrees of freedom. 


Compute the minimum number of clusters 


Suppose we wish to compute the minimum number of clusters necessary to 
achieve a desired test size, power, and effect size, based on regression on an 
existing dataset with G4 cluster increases and standard error 59. If we 
assume that additional clusters are of similar size and informational content 
to the existing clusters, then adding G clusters leads to a new standard error 


equal to sẹ/y/ (G1 + G2)/G1, and degrees of freedom increase to 
Gi +G2— 1. 


The power can be computed as in section 11.8.3, and a desired power 
attained by trial-and-error variation in Gp. 


Compute sample size with fixed number of clusters 


More challenging is the common situation where the number of clusters is 
given and we wish to find how many observations to include in each cluster. 
The preceding analysis no longer applies because increasing the number of 
observations in each cluster leads to smaller efficiency gains than usual 
because of within-cluster correlation. 


For OLS one way to proceed is to use the variance inflation factor 
introduced in section 3.4.6. Using this, increasing the average number of 
observations per cluster from M4 to Mə leaves the degrees of freedom 
remain unchanged while the standard error decreases by the multiple 
Nal + PzPu(M, — 1)//1 + PxPu( Mo — 1). Again the power can be 
computed as in section 11.8.3, and a desired power attained by trial and error 
variation in Go. 


11.8.5 Confidence intervals with desired precision 


The ciwidth command is the confidence-interval analog of the power 
command. 


As in section 11.8.2, consider an asymptotically normal distributed 
estimator g with standard error of 0.0843 based on a sample size of 150, and 
assume that observations are independent. We use the ciwidth onemean 
command to obtain the width of a confidence interval given a specified 
confidence level and sample size and to obtain the minimum sample size 
needed to obtain a confidence interval of specified width and confidence 
level. To do so, we use option sd (1.0325) because the contribution of a 


single observation is VN x s = V/150 x 0.0843 = 1.0325. 


For a sample size of 300, a 95% confidence interval will have width 
0.2337, where the width is the distance between the lower and upper values 
of the confidence interval. 


* Confidence interval width given estimator precision and sample size 
ciwidth onemean, knownsd level(95) sd(1.0325) n(300) 
> table(level N sd width, formats(level "%7.3f")) 


Estimated width for a one-mean CI 
Normal two-sided CI 


N sd width 


300 1.032 . 2337 


And to obtain a 95% confidence interval of width 0.2, we need at least 
410 observations. 


. * Minimum sample size for confidence interval of given width 
ciwidth onemean, knownsd level(95) sd(1.0325) width(0.2) 
> table(level N sd width, formats(level "4%7.3f")) 


Estimated sample size for a one-mean CI 
Normal two-sided CI 


sd 


95.000 410 1.032 .2 


These calculations are based on normal distribution critical values and 
implicitly assume that the estimator precision is exactly estimated. When ¢ 
distribution critical values are used, the ciwidth onemean command requires 
use of the probwidth() option to additionally allow for the randomness in 
the estimator precision. 


11.9 Specification tests 


The Wald, LR, and LM tests are often used for specification testing, 
particularly of inclusion or exclusion of regressors. In this section, we 
consider other specification testing methods that differ in that they do not 
work by directly testing restrictions on parameters. Instead, they test 
whether moment restrictions implied by the model, or other model 
properties, are satisfied. It is helpful to first read sections 13.3—13.4. 


11.9.1 Moment-based tests 


A moment-based test, or m test, is one of moment conditions imposed by a 
model but not used in estimation. Specifically, 


where m(-) is an h x 1 vector. Several examples follow below. The test 
statistic is based on whether the sample analogue of this condition is 
satisfied, that is, whether m(0) = 5 i MY, Xi; 6) = 0. This statistic is 
asymptotically normal because g is, and taking the quadratic form, we 
obtain a chi-squared statistic. The m-test statistic is then 


M=m (a) IV fm (6) D 7 m (6) ~ x?(h) under Ho 


As usual, we reject at the a level if p = Pr{y?(h) >W} < a. 


Obtaining V {m(6)} can be difficult. Often, this test is used after ML 
estimation because likelihood-based models impose many conditions that 
can be used as the basis for an m test. Then the auxiliary regression for the 
LM test (see section 11.5.3) can be generalized. We compute M as N times 
the uncentered R2 from the auxiliary regression of 1 on m(yi, Xi, 6) and 


s;(0), where s;(0) = On f(y; |x;, 0) /00. In finite samples, the test statistic 
has a size that can differ significantly from the nominal size, but this can be 
rectified by using a bootstrap with asymptotic refinement. 


An example of this auxiliary regression, used to test moment conditions 
implied by the tobit model, is given in section 19.4. 


11.9.2 Information matrix test 


For a fully parametric model, the expected value of the outer product of the 
first derivatives of In L(@) equals the negative expected value of the second 
derivatives. This property, called the IM equality, enables the variance 
matrix of the MLE to simplify from the general sandwich form A-!BA~! 
to the simpler form — A~}; see section 13.4.4. 


The IM test is a test of whether the IM equality holds. It is a special case 
of (11.12) with m(y;,x;, 0) equal to the unique elements in 
s;(@)s;(0)' + Os;(0) /O0. For the linear model under normality, the IM test 
is performed by using the estat imtest command after regress; see 
section 3.7 for an example. 


11.9.3 Chi-squared goodness-of-fit test 


A simple test of goodness of fit is the following. Discrete variable y takes 
on the values 1, 2, 3, 4, and 5, and we compare the fraction of sample 
values y that take on each value with the corresponding predicted 
probability from a fitted parametric regression model. The idea extends 
easily to partitioning on the basis of regressors as well as y and to 
continuous regressor Y, where we replace a discrete value with a range of 
values. 


Stata implements a goodness-of-fit test by using the estat gof 
command following logit, logistic, probit, and poisson. An example of 
estat gof following logit regression is given in section 17.5. A weakness 
of this command is that it treats estimated coefficients as known, ignoring 
estimation error. The goodness-of-fit test can instead be set up as an m test 
that provides additional control for estimation error; see Andrews (1988) 


and Cameron and Trivedi (2005, 266-271). For count data, the community- 
contributed chi2gof command of Manjón and Martinez (2014) performs 
the correct chi-squared goodness-of-fit test. 


11.9.4 Overidentifying restrictions test 


In the generalized method of moments (GMM) estimation framework of 
section 16.8, moment conditions E'{h(y;,x;,0@)} = 0 are used as the basis 
for estimation. In a just-identified model, the GMM estimator solves the 
sample analog pee, 6) = 0. In an overidentified model, these 
conditions no longer hold exactly, and an overidentifying restrictions test is 
based on the closeness of 55^ h(y;,x;,0) to 0, where @ is the optimal 
GMM estimator. The test is chi-squared distributed with degrees of freedom 
equal to the number of overidentifying restrictions. 


This test is most often used in overidentified Iv models, though it can be 
applied to any overidentified model. It is performed in Stata with the estat 
overid command after ivregress gmm; see section 7.4.8 for an example. 


11.9.5 Hausman test 


The Hausman test compares two estimators where one is consistent under 
both Ho and H, while the other is consistent under Ho only. If the two 
estimators are dissimilar, then Ho is rejected. An example is to test whether 
a single regressor is endogenous by comparing two-stage least-squares and 
OLS estimates. 


We want to test Hy: plim(@ — 0) = 0. Under standard assumptions, 
each estimator is asymptotically normal and so is their difference. Taking 
the usual quadratic form, we obtain 


H = (6-0) {Vv (a-a (6-8) ~ x?(h) under Ho 


The hausman command, available after many estimation commands, 
implements this test under the strong assumption that g is a fully efficient 
estimator. Then it can be shown that V(6 — 0) = V(@) — V (0) . In some 
common settings, the Hausman test can be more simply performed with a 
test of the significance of a subset of variables in an auxiliary regression. 
Both variants are demonstrated in section 8.8.5. 


The standard microeconometrics approach of using robust estimates of 
the vce implicitly presumes that estimators are not efficient. Then the 
preceding test is incorrect. One solution is to use a bootstrapped version of 
the Hausman test; see section 12.4.6. A second approach is to test the 
statistical significance in the appropriate auxiliary regression by using 
robust standard errors; see sections 7.4.6 and 8.8.5 for examples. 


11.9.6 Other tests 


The preceding discussion only scratches the surface of specification testing. 
Many model-specific tests are given in model-specific reference books such 
as Baltagi (2021) for panel data, Hosmer, Lemeshow, and Sturdivant (2013) 
for binary data, and Cameron and Trivedi (2013) for count data. Some of 
these tests are given in estimation command output or through 
postestimation commands, usually as an estat command, but many are not. 


11.10 Permutation tests and randomization tests 


A permutation test provides a test of exact size by the following method. We 
consider settings where the data are exchangeable under the null hypothesis. 
For example, in a two-sample difference-in-means test, we assume that the 
two samples are from the same distribution under the null hypothesis of no 
difference in means. 


First, compute the test statistic of interest using the original sample. 
Second, recompute this test statistic for every permutation of the data. Third, 
calculate the p-value as the fraction of times the permuted test statistics 
equal or exceed the value of the original sample test statistic. The test is a 
nonparametric test that does not require distributional assumptions, aside 
from the strong assumption that the data are exchangeable under the null 
hypothesis. Fisher proposed this exact test for contingency tables. It is well 
suited to randomized control trials. 


In practice, with a reasonable number of observations, there are many 
permutations of the original data. A randomized permutation test randomly 
selects some of these permutations. 


As an example, consider testing the difference in two means, which can 
be implemented by OLS regression of the outcome on an intercept and an 
indicator variable. The following example generates 100 observations, of 
which 47 have d = 1, with a weak relationship between y and d. 


. * Permutation test example - DGP and usual test 
. clear 


. set obs 100 
Number of observations (_N) was O, now 100. 


. set seed 10101 
. gen d = runiform() > 0.5 
. gen y = rnormal(1,1) + 0.2*d 


. sum y d 


Variable Obs Mean Std. dev. Min Max 


100 1.166272 .935289 -1.497009 2.873763 
100 47 .5016136 (0) 1 


a «< 


. regress y d, noheader 


y | Coefficient Std. err. t P>|t| [95% conf. interval] 


d 2217631 . 1870122 1.19 0.239 -. 1493565 . 5928827 
_cons 1.062043 . 1282091 8.28 0.000 8076161 1.31647 


The usual ¢ test for statistical significance of q has p = 0.239. 


Now we consider a permutation test where the 100 observations on y are 
randomly allocated (without replacement) to the unchanging 100 
observations on qd. In this example, this process is done 1,000 times leading 
to 1,000 ¢ statistics of Hp : Ba = 0. 


. * Permutation test 
. Capture program drop pols 


. program pols, rclass 


1. version 17 

2. args yd 

3. regress `y’ `d’ 

4. return scalar beta = _b[d] 

5. return scalar t = _b[d]/_se[d] 

6. end 
. permute y t=r(t) beta=r(beta), nowarn nodots seed(10101) reps(1000): pols y d 
Monte Carlo permutation results Number of observations = 100 
Permutation variable: y Number of permutations = 1,000 


Command: pols y d 
t: r(t) 
beta: r(beta) 


Monte Carlo error 

T T(obs) Test c n p SE(p) [95% CI(p)] 

t 1.185822 lower 879 1000 .8790 .0103 .8572 .8986 
upper 121 1000 .1210 .0103 .1014 .1428 

two-sided .2420 .0135 .2155 .2685 

beta . 2217631 lower 879 1000 .8790 .0103 .8572 .8986 
upper 121 1000 .1210 .0103 .1014 .1428 

two-sided .2420 .0135 .2155 .2685 


Notes: For lower one-sided test, c = #{T <= T(obs)} and p = p_lower = c/n. 
For upper one-sided test, c = #{T >= T(obs)} and p = p_upper = c/n. 
For two-sided test, p = 2*min(p_lower, p_upper); SE and CI approximate. 


The permutation test has a two-sided p-value of 0.242, close to the 0.239 
obtained from the original sample regression. 


Imbens and Rubin (2015, chap. 5) provide a detailed presentation of 
Fisher’s exact tests. Young (2019) contrasts randomization tests with 
conventional tests for many published randomized trials, and includes the 
typical complication of clustered trials. 


Extending permutation tests from regression with a single regressor to 
multiple regression is difficult and is a topic of current research. It is 
possible if the regressor of interest is uncorrelated with other regressors, as is 
the case if the regressor is a randomly assigned treatment. 


Randomization tests provide a general method to control test size in 
finite samples, under the assumption that the distribution of the data is 
invariant under a group of transformations under the null hypothesis, a 
property called symmetry. Canay, Romano, and Shaikh (2017) extend this 
theory to settings where the limiting distribution of a function of the data is 
symmetric under the null hypothesis. Cai et al. (Forthcoming) detail 
implementation. This enables finite-sample inference with few clusters and 
many observations per cluster. 


11.11 Additional resources 


The Stata documentation [FN] Statistical functions, help density 
functions, Or help functions describes the functions to compute p-values 
and critical values for various distributions. For testing, see the relevant 
entries for the commands discussed in this chapter: [R] test, [R] testnl, 

[R] lincom, [R] nicom, [R] Irtest, [R] hausman, [R] regress postestimation 
(for estat imtest), and [R] estat. For power analysis, see [Pss] Stata 
Power, Precision, and Sample-Size Reference Manual. 


Much of the material in this chapter is covered in Cameron and 
Trivedi (2005, chap. 7 and 8) and Hansen (2022a, chap. 9) and various 
chapters of Greene (2018) and Wooldridge (2010). Multiple testing is a 
more recent topic. 


11.12 Exercises 


1. For a x? (h) random variable, the density is given by 
Fly) = {y P- exp(—y/2)}/{2"/?P(h/2)}, where D(-) is the 
gamma function and T (A/2) can be obtained as exp (1ngamma (h/2) ). 
Plot this density for  — 5 and y < 25. 

2. Use Stata commands to find the appropriate p-values for t(100), 
F(1,100), Z, and y7(1) distributions at y = 2.5. For the same 
distributions, find the critical values for tests at the 0.01 level. 

3. Consider the Poisson example in section 11.3.3, with a robust estimate 
of the vce. Use the test or testn1 commands to test the following 
hypotheses: 1) Ho: Pisnete — 100 x Dincome = 0.5; 2) Ho: Prenale =0 
; 3) test the previous two hypotheses jointly with the mtest option; 4) 
Ho: Brenare = 0; and 5) Ho: Beier, = 1. Are you surprised that the 
second and fourth tests lead to different Wald test statistics? 

4. Consider the test of Ho: Peemate/ private — 1 = 0, given in 
section 11.3.6. It can be shown that, given that G has the entries 
Bprivates Pironi Dronate P inona and D ae then R defined in (11.5) 
is given by 


Z een Bronare 5 0 _ 1 0 0 
(iam ) Pprivate 


Manually calculate the Wald test statistic defined in (11.5) by adapting 
the code at the end of section 11.3.3. 


5. The claim is made that the effect of private insurance on doctor visits 
is less than that of having a chronic condition, that is, that 
Dprivate — chronic < 0. Test this claim at the 0.05 level. Obtain 95% 
and 99% confidence intervals for Bprivate — chronic- 

6. Consider the negative binomial example in section 11.4.1, where we 
test Ho: a = 0. Use the output from the noreg command to compute 
the Wald test statistic, and compare this with the LR test statistic given 
in the output. Next, calculate this Wald test statistic using the testn1 
command as follows: Fit the model by using nbreg, and then type 


estat vce. You will see that the estimate 1nalpha is saved in the 
equation 1nalpha with the name cons. We want to test œa = 0, in 
which case exp(Ina) = 0. Give the command testn1 

exp ([lnalpha] cons) =0. Compare the results with your earlier 
results. 

. Consider the Poisson example of section 11.3.3. The parametric 
Poisson model imposes the restriction that Var(y|x) = exp(x’@). An 
alternative model is that Var(y|x) = exp(x’B) + a x exp(x’)?. A 
test of overdispersion is a test of Hp: a < 0 against H,: a > 0. The 
LM test statistic can be computed as the ¢ test that q = 0 in the 
auxiliary OLS regression of { (y; — i)? — fi }/R; on f; (no intercept), 
where f; = exp(x’B). Perform this test. 

. Consider the DGP y = 0 + Bax + £, x ~ N(0,1), N = 36. For this 
DGP, what do you expect Var( Bo) to equal? Consider a test of 

Ho: B2 = 0 at the 0.05 level. By simulation, find the size of the test 
and the power when 2 = 0.25. 

. Consider the same DGP as in the previous question, but adapt it to a 
probit model by defining y* = 0 + Gox + £, and define y = 1 if 

y* > Oand y = 0 otherwise. Consider a test of Hp: 82 = 0 at the 0.05 
level. By simulation, find the size of the test and the power when 

Bo = 0.25. 


Chapter 12 
Bootstrap methods 


12.1 Introduction 


Bootstrap methods enable statistical inference by resampling the data. The 
most common use of the bootstrap in microeconometrics applications is to 
provide standard error estimates when analytical expressions are quite 
complicated. These standard errors are then used to form confidence 
intervals and test statistics. 


Additionally, a more complicated bootstrap with asymptotic refinement 
can provide tests with an actual size closer to the nominal size and 
confidence intervals with an actual coverage rate closer to the nominal 
coverage rate, compared with the standard inferential methods presented in 
the preceding chapter. The leading microeconometrics application of a 
bootstrap with asymptotic refinement is the wild cluster bootstrap. 


12.2 Bootstrap methods 


A bootstrap provides a way to perform statistical inference by resampling 
from the sample. The statistic being studied is usually a standard error, a 
confidence interval, or a test statistic. 


12.2.1 Bootstrap estimate of standard error 


As a leading example, consider calculating the standard error of an 
estimator g when this is difficult to do using conventional methods. 
Suppose 400 random samples from the population were available. Then we 
could obtain 400 different estimates of @ and let the standard error of g be 
the standard deviation of these 400 estimates. 


In practice, however, only one sample from the population is available. 
The bootstrap generates multiple samples by resampling from the current 
sample. Essentially, the observed sample is viewed as the population, and 
the bootstrap is a method to obtain multiple samples from this population. 
Given 400 bootstrap resamples, we obtain 400 estimates and then estimate 
the standard error of g by the standard deviation of these 400 estimates. 


Let @*,..., 0%, denote the estimates, where here B = 400. Then the 
bootstrap estimate of the variance of @ is 


Voor (8) = — D (& - F) (12.1) 


b=1 


where @* — 1 /B peer 0% is the average of the B bootstrap estimates. 


The square root of Vgoot (0), denoted by se,.,,, (0), is called the 
bootstrap estimate of the standard error of g. Some authors more correctly 
call this the bootstrap standard error of @, or the bootstrap estimate of the 


standard deviation of g, because the term “standard error” means estimated 
standard deviation. 


12.2.2 Bootstrap methods 


The plural form, bootstrap methods, is used because there is no one single 
bootstrap. As already noted, the bootstrap can be used to obtain the 
distribution of many different statistics. There are several different ways to 
obtain bootstrap resamples. Even for a given statistic and bootstrap 
resampling method, there are different ways to proceed. 


The simplest bootstraps implement standard asymptotic methods. The 
most common use of the bootstrap in microeconometrics is standard error 
estimation. More complicated bootstraps implement more refined 
asymptotics. 


12.2.3 Asymptotic refinement 


Consider a statistic such as a Wald test statistic of a single restriction. 
Asymptotic methods are used to obtain an approximation to the cumulative 
distribution function of this statistic. For a statistic with a limiting normal 
distribution based on conventional first-order root- N asymptotics, the 
approximation error behaves in the limit as a multiple of )7—1/2 (so the 
error disappears as N — oo). For example, a one-sided test with the 
nominal size of 0.05 will have the true size 0.05 + O(N~!/2), where 
O(N~—1/2) behaves as a constant divided by \/NV. 


Asymptotic methods with refinement have an approximation error that 
disappears at a faster rate. In particular, bootstraps with asymptotic 
refinement can implement second-order asymptotics that yield an 
approximation error that behaves as a multiple of )y—1. So a one-sided test 
with a nominal size of 0.05 now has a true size of 0.05 + O(N~?). This 
improvement is only asymptotic and is not guaranteed to exist in small 
samples. But simulation studies usually find that the improvement carries 
over to small samples. 


We present bias-corrected (BC) confidence intervals that provide 
asymptotic refinement in sections 12.3.6—12.3.8. Hypothesis tests and 
confidence intervals based on the percentile-¢ method that provide 
asymptotic refinement are presented in section 12.5, and specialization to 
the wild cluster bootstrap is presented in section 12.6. 


12.2.4 Use the bootstrap with caution 


Caution is needed in applying the bootstrap because it is easy to misapply. 
For example, one can always compute S€goot (0) by using the formula in 
(12.1). But this estimate is inconsistent if, for example, the bootstrap 
resampling scheme assumes independent observations when observations 
are in fact correlated. And in some cases, Var(6) does not exist, even 
asymptotically. Then sep (0) is estimating a nonexistent standard 
deviation. 


The bootstraps presented in this chapter assume independence of 
observations or of clusters of observations. This does permit dependence 
via clustering, provided that observations are combined into clusters that are 
independent and the bootstrap is over the clusters. Then the bootstrap 
commands given in this chapter should include the cluster (varlist) 
option, where varlist denotes the clustering variables. The 
idcluster (newvar) option may additionally be needed; see section 12.3.5. 


The bootstraps also assume that the estimator is a smooth estimator that 
is root- N consistent and asymptotically normal distributed. Some variations 
of the bootstrap can be applied to more complicated cases than this, but one 
should first read relevant journal articles. In particular, care is needed for 
estimators with a nonparametric component, for nonsmooth estimators, and 
for dependent data. 


The Stata defaults for the number of bootstrap replications are set very 
low to speed up computation time. These values may be adequate for 
exploratory data analysis but should be greatly increased for published 
results; see section 12.3.4. And for published results, the seed should be set, 
using set seed, rather than using a sequence initially set by the default 
value used by Stata, to enable replication. 


12.3 Bootstrap pairs using the vce(bootstrap) option 


The most common use of the bootstrap is to obtain a consistent estimate of 
the standard errors of an estimator, with no asymptotic refinement. With 
standard Stata estimation commands, this can be easily done by using the 
vce (bootstrap) option. 


12.3.1 Bootstrap-pairs method to estimate the variance—covariance 
matrix of the estimator 


Let w; denote all the data for the jth observation. Most often, w; = (yi, Xi)» 
where y is a scalar dependent variable and x is a regressor vector. More 
generally, w; = (Yi, Xi, Zi), where now there may be several dependent 
variables and z denotes instruments. We assume w; is 1.1.d. Over į. 


Stata uses the following bootstrap-pairs algorithm: 


1. Repeat steps a and b B independent times: 
a. Draw a bootstrap sample of size N by sampling with replacement 


from the original data W1,..., Ww. Denote the bootstrap sample 
by wj,..., Wy. 
b. Calculate an estimate, 9", of @ based on wj,..., Wiy. 


2. Given the B bootstrap estimates, denoted by Ô., ae Op the bootstrap 
estimate of the variance—covariance matrix of the estimator (VCE) is 


A = i. Cee. Sede, eo? 
Poot (8) = zX (8-0) (8; - 8) 


where ĝ — B-! ae ð,- 


The corresponding standard-error estimate of the jth component of @ is 
then 


senoot (8) = {Wroot.ii()} 


The bootstrap resamples differ in the number of occurrences of each 
observation. For example, the first observation may appear twice in the first 
bootstrap sample, zero times in the second sample, once in the third sample, 
once in the fourth sample, and so on. 


The method is called bootstrap pairs or paired bootstrap because in the 
simplest case w; = (y;, Xi) and the pair (y;,x;) is being resampled. It is also 
called a case bootstrap because all the data for the jth case are resampled. 
And it is called a nonparametric bootstrap because no information about the 
conditional distribution of y; given X; is used. For cross-sectional estimation 
commands, this bootstrap gives the same standard errors as those obtained 
by using the vce (robust) option if B > oo, aside from possible differences 
due to degrees-of-freedom correction that disappear for large N. 


This bootstrap method is easily adapted to cluster pairs bootstraps. Then 
w; becomes we, where c = 1,...,C denotes each of the C clusters, data are 
independent over c, resampling is over clusters, and the bootstrap resample 
is of size C clusters. 


12.3.2 The vce(bootstrap) option 


The bootstrap-pairs method to estimate the vcE can be obtained for most 
Stata cross-sectional estimation commands by using the estimator command 
option, 


vce (bootstrap |, bootstrap_options |) 


which is often abbreviated to vce (boot). We list many of the options in 
section 12.4.1 and illustrate some of the options in the following example. 


The vce (bootstrap) option is also available for most panel-data xt 
estimation commands. The default bootstrap is then actually a cluster 
bootstrap over individuals ;, rather than one over the individual observations 


(it): 


12.3.3 Bootstrap standard-errors example 


We demonstrate the bootstrap using the same data on doctor visits (docvis) 
as those introduced in chapter 10, except that we use 1 regressor (chronic) 
and just the first 50 observations. This keeps output short, reduces 
computation time, and restricts attention to a small sample where the gains 
from asymptotic refinement may be greater. 


* Sample is 50 observations and a few variables from chapter 10 data 
. qui use mus212bootdata 


summarize 
Variable Obs Mean Std. dev. Min Max 
docvis 50 4.12 7.82106 (0) 43 
age 50 4.162 1.160382 2.6 6.2 
chronic 50 .28 . 4535574 (0) 1 
For standard error computation, we set the number of bootstrap 
replications to 400. We have 
. * Option vce(bootstrap) to compute bootstrap standard errors 
. poisson docvis chronic, vce(bootstrap, reps(400) seed(10101) nodots) 
Poisson regression Number of obs = 50 
Replications = 400 
Wald chi2(1) = 3.33 
Prob > chi2 = 0.0679 
Log likelihood = -238.75384 Pseudo R2 = 0.0917 


Observed Bootstrap 
docvis coefficient std. err. Zz 
chronic . 9833014 . 5386575 1.83 
_cons 1.031602 . 3536507 2.92 


Normal-based 


[95% conf. interval] 
-.0724478 2.039051 
. 338459 1.724744 


The output is qualitatively the same as that obtained by using any other 
method of standard error estimation. Quantitatively, however, the standard 
errors change, leading to different test statistics, z-values, and p-values. For 
chronic, the standard error of 0.539 is similar to the robust estimate of 
0.515 given in the last column of the results from estimates table in the 
next section. Both standard errors control for Poisson overdispersion and are 
much larger than the default standard errors from poisson. 


12.3.4 How many bootstraps? 


More is better because when B = oo, the same results are obtained 
regardless of the initial seed. The Stata default is to perform 50 bootstrap 
replications to minimize computation time. This value may be useful during 
the modeling cycle, but for final results given in an article, this value is 
much too low. 


At a time of much less computing power, Efron and Tibshirani (1993, 
52) stated that for standard error estimation “B = 50 is often enough to give 
a good estimate” and “very seldom are more than B = 200 replications 
needed.” Andrews and Buchinsky (2000) show that the bootstrap estimate of 
the standard error of g with B = 384 is within 10% of that with B = oo 
with a probability of 0.95, in the special case that g has no excess kurtosis. 
The community-contributed bssize command (Poi 2004) performs the 
calculations needed to implement the methods of Andrews and 
Buchinsky (2000). In this book, we often use B — 400 for bootstrap 
estimation of standard errors. For published research, B should be much 
higher than this. 


For bootstrap confidence intervals and hypothesis tests, B needs to be 
even higher than for standard error regressions because confidence levels 
and test sizes use tails of the distribution of the estimator. For tests at the a 
level or at 100(1 — a)% confidence intervals, there are reasons for choosing 
B so that a(B + 1) is an integer. In subsequent analysis, we use B = 999 
for confidence intervals and hypothesis tests when a = 0.05. Ideally, 
published results use a much higher value, such as B — 9999, so that the 
value of the seed has little effect on results. 


To see the effects of the number of bootstraps on standard error 
estimation, we compare results with very few bootstraps, B — 50, using two 
different seeds, and with many bootstraps, B — 2000, and two different 
seeds. We also present the robust standard error obtained by using the 
vce (robust) option. We have 


. * Bootstrap standard errors for different reps and seeds 
. qui poisson docvis chronic, vce(bootstrap, reps(50) seed(10101)) 


. estimates store boot50 

. qui poisson docvis chronic, vce(bootstrap, reps(50) seed(20202) ) 

. estimates store boot50diff 

. gui poisson docvis chronic, vce(bootstrap, reps(2000) seed(10101)) 
. estimates store b2000 

. gui poisson docvis chronic, vce(bootstrap, reps(2000) seed(20202) ) 
. estimates store b2000diff 

. qui poisson docvis chronic, vce(robust) 


. estimates store robust 


. estimates table boot50 boot50diff b2000 b2000diff robust, b(%8.5f) se(%8.5f) 


Variable boot50 boot50~f b2000 b2000d~f robust 


chronic 0.98330 0.98330 0.98330 0.98330 0.98330 
0.45444 0.59923 0.54178 0.53413 0.51549 

_cons 1.03160 1.03160 1.03160 1.03160 1.03160 
0.37131 0.37533 0.36414 0.36428 0.34467 


Legend: b/se 


Comparing the two replications with B = 50 but different seeds, we see the 
standard error of chronic differs by 15% (0.454 versus 0.599). For 

B = 2000, the bootstrap standard errors still differ from each other (0.542 
versus 0.534) and also from the robust standard errors (0.515) partly because 
of the use of the multiplicative finite-sample adjustment N/(N — K) with 
N = 50 in calculating robust standard errors. 


12.3.5 Cluster pairs bootstrap 


For cross-sectional estimation commands, the vce (bootstrap) option 
performs a paired bootstrap that assumes independence over 7. The bootstrap 
resamples are obtained by sampling from the individual observations with 
replacement. 


The data may instead be clustered, with observations correlated within 
cluster and independent across clusters. The vce (bootstrap, 
cluster (varlist) ) option performs a cluster bootstrap that samples the 


clusters with replacement. If there are C clusters, then the bootstrap 
resample has C clusters. This may mean that the number of observations 
N= Y _ Ne may vary across bootstrap resamples, but this poses no 


problem. 


As an example, 


. * Option vce(bootstrap, cluster) to compute cluster-bootstrap standard errors 
. poisson docvis chronic, vce(bootstrap, cluster(age) reps(400) seed(10101) nodots) 


Poisson regression Number of obs = 50 
Replications = 400 
Wald chi2(1) = 4.19 
Prob > chi2 = 0.0407 
Log likelihood = -238.75384 Pseudo R2 = 0.0917 


(Replications based on 26 clusters in age) 


Observed Bootstrap Normal-based 
docvis | coefficient std. err. z P>|z| [95% conf. interval] 
chronic . 9833014 . 4805585 2.05 0.041 .0414241 1.925179 
-cons 1.031602 . 2936567 3.51 0.000 .4560451 1.607158 


The cluster-pairs bootstrap estimate of the standard error of Ggnronic 1S 0.481. 
If we instead obtain the usual (nonbootstrap) cluster—robust standard errors, 
using the vce (cluster age) option, the cluster estimate of the standard error 
is 0.449. 


In this somewhat manufactured example, the cluster-robust standard 
errors are similar to the heteroskedastic—robust standard errors. In practice, 
cluster-robust standard errors can be much larger if errors are clustered; see 
section 3.4.6. 


The asymptotic theory for inference with clustered data is based on the 
assumption that the number of clusters G — oo. In many applications, there 
may be few clusters; for example, clustering may be on region with only 10 
regions. Then there can be considerable size distortion in hypothesis tests 
using either cluster—robust standard errors or cluster-bootstrap standard 
errors. Then one should use the wild cluster bootstrap presented in 
section 12.6, which provides an asymptotic refinement. 


Some applications use cluster identifiers in computing estimators. For 
example, suppose cluster-specific indicator variables (fixed effects) are 


directly included as regressors. This can be done, for example, by using 
factor-variable notation and the regressors i .id, where id is the cluster 
identifier. If the first cluster in the original sample appears twice in a cluster- 
bootstrap resample, then its cluster dummy will be nonzero twice in the 
resample, rather than once, and the cluster dummies will no longer be unique 
to each observation in the resample. For the bootstrap resample, we should 
instead define a new set of C unique cluster dummies that will each be 
nonzero exactly once. The idcluster(newvar) option does this, creating a 
new variable containing a unique identifier for each observation in the 
resampled cluster. This is particularly relevant for estimation with fixed 
effects, including fixed-effects panel-data estimators. 


For most xt commands, the vce (bootstrap) option default is to perform 
a cluster bootstrap with clustering on the cluster identifier specified in the 
xtset command. For xt commands with the fe option, there is no need to 
use the idcluster() option because the xt command is coded to account for 
the preceding complication. 


For estimation commands without the vce (bootstrap, cluster () ) 


option, one can instead use the bootstrap prefix with option cluster (); see 
section 12.4.1. 


12.3.6 Bootstrap confidence intervals 


The output after a command with the vce (bootstrap) option includes a 
“normal-based” 95% confidence interval for o that equals 


ô- 1.96 X S€Boot (4) 0+ 1.96 X S€Boot (8) 


and is a standard Wald asymptotic confidence interval, except that the 
bootstrap is used to compute the standard error. 


Additional confidence intervals can be obtained by using the estat 
bootstrap postestimation command, defined in the next section. 


The percentile method uses the relevant percentiles of the empirical 
distribution of the B bootstrap estimates Ož... 0% In particular, a 


percentile 95% confidence interval for @ is 
85.28, PA 


ranging from the 2.5th percentile to the 97.5th percentile of 0%, ae 0%. This 
confidence interval has the advantage of being asymmetric around @ and 
being invariant to monotonic transformation of 9. Like the normal-based 
confidence interval, it does not provide an asymptotic refinement, but there 
are still theoretical reasons to believe it provides a better approximation than 
the normal-based confidence interval. 


The validity of bootstrap confidence intervals and tests requires 
convergence of the bootstrap distribution. The validity of bootstrap standard 
errors requires stronger uniform integrability conditions because 
convergence in distribution does not imply convergence in moments. So the 
bootstrap percentile method requires weaker assumptions than the “normal- 
based” method, which uses the bootstrap standard errors. 


The Bc method is a modification of the percentile method that 
incorporates a bootstrap estimate of the finite-sample bias in 9. For example, 
if the estimator is upward biased, as measured by estimated median bias, 
then the confidence interval is moved to the left. So if 40%, rather than 50%, 
of 0%, or 0%, are less than g, then a BC 95% confidence interval might use 


(5.007; 96.927) Say; rather than [65 025; 49.975]: 

The bias-corrected accelerated (BCa) confidence interval is an adjustment 
to the Bc method that adds an “acceleration” component that permits the 
asymptotic variance of @ to vary with 6. This requires the use of a jackknife 
that can add considerable computational time and is not possible for all 
estimators. The formulas for Bc and BCa confidence intervals are given in 
[R] bootstrap and in books such as Efron and Tibshirani (1993, 185) and 
Davison and Hinkley (1997, 204). 


The Bca confidence interval has the theoretical advantage over the other 
confidence intervals that it offers an asymptotic refinement, defined in 
section 12.2.3. So a Bca 95% confidence interval has a coverage rate of 
0.95 + O(N~+), compared with 0.95 + O(N-1/2) for the other methods. 


An alternative, quite general method to obtain an asymptotic refinement 
is to use the percentile- method, presented in section 12.5.1 and applied in 
section 12.6. 


In practice, microeconometrics studies use the percentile-¢ method if 
asymptotic refinement is desired; the Bc and Bca confidence intervals are 
seldom used. The estat bootstrap command does not provide percentile-t 
confidence intervals, but these can be obtained by using the bootstrap 
prefix, as we demonstrate in section 12.5.3, and for a wild bootstrap by using 
the user-provided boottest command that is presented in section 12.6.2. 


12.3.7 The estat bootstrap postestimation command 


The estat bootstrap command can be issued after an estimation command 
that has the vce (bootstrap) option or after the bootstrap prefix. The 
syntax for estat bootstrap IS 


estat bootstrap E options | 


where the options include normal for normal-based confidence intervals, 
percentile for percentile-based confidence intervals, bc for Bc confidence 
intervals, and option bca for BCa confidence intervals. If you want to use the 
bca option, the preceding bootstrap must be done with the bca option 
specified to perform the necessary additional jackknife computation. The 
all option provides all available confidence intervals. 


12.3.8 Bootstrap confidence-intervals example 


We obtain these various confidence intervals for the Poisson example. To 
obtain the BCa interval, we must use the bca option with the original 
bootstrap. To speed up bootstraps, we should include only necessary 
variables in the dataset. For reasonable bootstrap precision, we set B — 999. 


We have 


. * Bootstrap confidence intervals: Normal based, percentile, BC, and BCa 
. gui poisson docvis chronic, vce(bootstrap, reps(999) seed(10101) bca) 


. estat bootstrap, all 


Poisson regression Number of obs = 50 
Replications = 999 
Observed Bootstrap 
docvis | coefficient Bias std. err. [95% conf. interval] 
chronic . 98330144 .0132307 . 54137854 -.077781 2.044384 (N) 


-.0139438 2.061742 (P) 

-.079079 2.019438 (BC) 

0295944 2.08349 (BCa) 

_cons 1.0316016 -.0769223 . 35685342 . 3321817 1.731021 (N) 
. 186586 1.582409 (P) 

. 268264 1.641356 (BC) 

. 386773 1.771351 (BCa) 


Key: N: Normal 
P: Percentile 
BC: Bias-corrected 
BCa: Bias-corrected and accelerated 


The confidence intervals for Benronic are, respectively, [—0.08, 2.04], 
[—0.014, 2.06], [—0.08, 2.02], and [0.03, 2.08]. The differences here are not 
great. Only the normal-based confidence interval is symmetric about Benak 


12.3.9 Bootstrap estimate of bias 


Suppose that the estimator @ is biased for Q. Let a be the average of the B 
bootstraps and g be the estimate from the original model. Note that A is not 
an unbiased estimate of 0. Instead, the difference a —ĝ provides a 


bootstrap estimate of the bias of the estimate g. The bootstrap views the 
data-generating process (DGP) value as g, and A is viewed as the mean of 


the estimator given this DGP value. 


Below, we list e (b bs), which contains the average of the bootstrap 
estimates. 


. matrix list e(b_bs) 


e(b_bs) [1,2] 
docvis: docvis: 
chronic _cons 
yi .99653216 .95467931 


The above output indicates that a — 0.9965; the output from estat 
bootstrap, all indicates that g@ — 0.9833- Thus, the bootstrap estimate of 
bias is 0.0132, which is reported in the estat bootstrap, all output. 
Because 9 — 0.9833 is upward biased by 0.0132, we must subtract this bias 
to get a BC estimate of 9 that equals 0.9833 — 0.0132 = 0.9701. Such Bc 
estimates are not used often, however, because the bootstrap estimate of 
mean bias is a very noisy estimate; see Efron and Tibshirani (1993, 138). 


12.4 Bootstrap pairs using the bootstrap command 


The bootstrap prefix can be applied to a wide range of Stata commands 
such as nonestimation commands, community-contributed commands, two- 
step estimators, and Stata estimators without the vce (bootstrap) option. 
Before doing so, the user should verify that the estimator is one for which it 
is appropriate to apply the bootstrap; see section 12.2.4. 


12.4.1 The bootstrap prefix 


The syntax for the bootstrap prefix is 


bootstrap explist E options eform_option | : command 


The command being bootstrapped can be an estimation command, other 
commands such as summarize, or community-contributed commands. The 
argument explist provides the quantity or quantities to be bootstrapped. 
These can be one or more expressions, possibly given names [so newvar = 


(exp) |. 


For estimation commands, not setting exp/ist or setting explist to _ b leads 
to a bootstrap of the parameter estimates. Setting explist instead to _ se leads 
to a bootstrap of the standard errors of the parameter estimates. Thus 
bootstrap: poisson y x bootstraps parameter estimates, as does bootstrap 
_b: poissonyx. The bootstrap se: poisson y x command instead 
bootstraps the standard errors. The bootstrap b[x]: poisson y x command 
bootstraps just the coefficient of x and not that of the intercept. The 
bootstrap bx= b[x]: poisson y x command does the same, with the results 


named bx rather than with the default name of bs 1. 


The options include reps (#) to set the number of bootstrap replications; 
seed (#) to set the random-number generator seed value to enable 
reproducibility; nodots to suppress dots produced for each bootstrap 
replication; group (varname) , which may be needed along with 
idcluster()3; strata(varlist) for bootstrap over strata; size (#) to draw 
samples of size #; bca to compute the acceleration for a BCa confidence 


interval; and saving() to save results from each bootstrap iteration in a file. 
The eform_option option enables bootstraps for e? rather than @. 


A cluster pairs bootstrap can be performed using the cluster (varlist) 
option; the idcluster (newvar) option is needed for some cluster bootstraps 
(see section 12.3.5). For panel data with identifiers for both individual and 
time set by the xt set command, the cluster bootstrap can then fail with error 
message 


repeated time values within panel 
the most likely cause for this error is misspecifying the cluster(), idcluster(), 
or group() option 


To avoid this, one can xtset just the individual identifier. 


If bootstrap 1s applied to commands other than Stata estimation 
commands, it produces a warning message. For example, the user-written 
poissrobust command defined below, leads to the warning 


Warning: Since poissrobust is not an estimation command or does not set 
e(sample), bootstrap has no way to determine which observations are 
used in calculating the statistics and so assumes that all 

observations are used. This means no observations will be excluded 
from the resampling because of missing values or other reasons. 


If the assumption is not true, press Break, save the data, and drop 
the observations that are to be excluded. Be sure that the dataset 
in memory contains only the relevant data. 


Because we know that this is not a problem in the examples below and we 
want to minimize output, we use the nowarn option to suppress this warning. 


The output from bootstrap includes the bootstrap estimate of the 
standard error of the statistic of interest and the associated normal-based 
95% confidence interval. The estat bootstrap command after bootstrap 
can also compute percentile, BC and BCa confidence intervals. For brevity, we 
do not obtain these additional confidence intervals in the examples below. 


12.4.2 Bootstrap parameter estimate from a Stata estimation command 


The bootstrap prefix is easily applied to an existing Stata estimation 
command. It gives exactly the same result as given by directly using the 


Stata estimation command with the vce (bootstrap) option, if this option is 
available and the same values of B and the seed are used. 


We illustrate this for doctor visits. Because we are bootstrapping 
parameter estimates from an estimation command, there is no need to 
provide explist. 


. * bootstrap command applied to Stata estimation command 
. bootstrap, reps(400) seed(10101) nodots noheader: poisson docvis chronic 


Observed Bootstrap Normal-based 
docvis coefficient std. err. Zz P>|z| (95% conf. interval] 
chronic . 9833014 . 5386575 1.83 0.068 -.0724478 2.039051 
_cons 1.031602 . 3536507 2.92 0.004 . 338459 1.724744 


The results are exactly the same as those obtained in section 12.3.3 by using 
poisson with the vce (bootstrap) option. 


12.4.3 Bootstrap standard error from a Stata estimation command 


g is not an exact estimate of @, and, in addition, se(@) is not an exact 


estimate of the standard deviation of the estimator 9. We consider a 
bootstrap of the standard error, se(ĝ), to obtain an estimate of the standard 


error of se(ĝ). 
We bootstrap both the coefficients and their standard errors. We have 


. * Bootstrap estimate of the standard error of a coefficient estimate 
. bootstrap _b _se, reps(400) seed(10101) nodots: poisson docvis chronic 


Bootstrap results Number of obs 50 
Replications = 400 


Observed Bootstrap Normal-based 
coefficient std. err. Z P>l|z| [95% conf. interval] 
docvis 
chronic . 9833014 .5386575 1.83 0.068 -.0724478 2.039051 
_cons 1.031602 . 3536507 2.92 0.004 . 338459 1.724744 


docvis_se 
chronic . 1393729 .0236361 5.90 0.000 .0930471 . 1856987 
_cons .0995037 . 0202009 4.93 0.000 .0599106 . 1390968 


The bootstrap reveals that there is considerable noise in Së (Doroi) with an 
estimated standard error of 0.024 and the 95% confidence interval [0.093, 
0.186]. 


Ideally, the bootstrap standard error of 8, here 0.539, should be close to 
0.139, the estimated se( 3) from the original estimation sample. And it 
should be close to the average of the se(3)’s from the 400 bootstraps that 
from additional code not given is 0.150. The large difference is a clear sign 
of problems in the method used to obtain SOC tissue: The problem is that 
the default Poisson standard errors were used in poisson above, and given 
the large overdispersion, these standard errors are very poor. If we repeated 
the exercise with poisson and the vce (robust) option, this difference 
disappears because the robust standard error is 0.515. 


12.4.4 Bootstrap standard error from a user-written estimation 
command 


Continuing the previous example, we would like an estimate of the 
properties of the robust standard errors after Poisson regression. This can be 
obtained by using poisson with the vce (robust) option in the preceding 
bootstrap. We instead use an alternative approach that can be applied in a 
wide range of settings. 


We write a program named poissrobust that returns the Poisson 
maximum-likelihood estimator (MLE) estimates in b and the robust estimate 
of the vce of the Poisson MLE in v. Then, we apply the bootstrap prefix to 
poissrobust rather than to poisson, vce (robust). 


Because we want to return e and v, the program must be eclass. The 
program is 


* Program to return b and robust estimate V of the VCE 
program poissrobust, eclass 

version 17 

tempname b V 

poisson docvis chronic, vce(robust) 

matrix `b’ = e(b) 

matrix ~V~ = e(V) 

ereturn post `b’ `V? 
end 


Next, it is good practice to check the program, typing the commands 


* Check preceding program by running once 
. poissrobust 


(output omitted ) 
ereturn display 


(output omitted ) 
The omitted output is the same as that from poisson, vce (robust). 


We then bootstrap 400 times. The bootstrap estimate of the standard 
error of se(@) is the standard deviation of the B values of se(@). We have 


* Bootstrap standard-error estimate of robust standard errors 
. bootstrap _b _se, reps(400) seed(10101) nodots nowarn: poissrobust 


Bootstrap results Number of obs = 50 
Replications = 400 


Observed Bootstrap Normal-based 
coefficient std. err. Zz P>lz| [95% conf. interval] 
docvis 

chronic . 9833014 .5386575 1.83 0.068 -.0724478 2.039051 
_cons 1.031602 . 3536507 2.92 0.004 . 338459 1.724744 

docvis_se 
chronic .5154894 .0772422 6.67 0.000 . 3640974 .6668814 
_cons . 3446734 . 065348 5.27 0.000 . 2165936 .4727532 


There is considerable noise in the robust standard error, with the standard 
error of S€(Benronic) equal to 0.077 and a 95% confidence interval of [0.364, 
0.667]. The upper limit is about twice the lower limit, as was the case for the 
default standard error. In other examples, robust standard errors can be much 
less precise than default standard errors. 


12.4.5 Bootstrap two-step estimator 


The preceding method of applying the bootstrap prefix to a user-defined 
estimation command can also be applied to a two-step estimator. 


A sequential two-step estimator of, say, @ is one that depends in part on 
a consistent first-stage estimator, say, &. In some examples—notably, 
feasible generalized least squares, where @ denotes error variance 
parameters—one can do regular inference, ignoring any estimation error in 
&. More generally, however, the asymptotic distribution of @ will depend on 
that of @. Asymptotic results do exist that confirm the asymptotic normality 
of leading examples of two-step estimators and provide a general formula 
for Var(). But this formula is usually complicated, both analytically and in 


implementation. A much simpler method is to use the bootstrap, which is 
valid if indeed the two-step estimator is known to be asymptotically normal. 


A leading example is Heckman’s two-step estimator in the selection 
model; see section 19.6.4. We use the same example as in that section. We 
first read in the data and form the dependent variable dy and the regressor 
list given in xlist. 


. * Set up the selection model two-step estimator data of the tobit chapter 
. qui use mus219mepsambexp, clear 


. global xlist age female educ blhisp totchr ins 


The following program produces the Heckman two-step estimator: 


* Program to return b for Heckman two-step estimator of selection model 
program hecktwostep, eclass 
version 17 
tempname b V 
tempvar xb 
capture drop invmills 
probit dy $xlist 
predict `xb’, xb 
generate invmills = normalden(* xb”)/normal (` xb“) 
regress lny $xlist invmills if dy== 
matrix `b” = e(b) 
ereturn post `b’ 
end 


This program can be checked by typing hecktwostep in isolation. This leads 
to the same parameter estimates as in section 19.6.4. Here 8 denotes the 
second-stage regression coefficients of regressors and the inverse of the 
Mills ratio. The inverse of the Mills ratio depends on the first-stage probit 


parameter estimates q. 


To obtain correct standard errors that control for the two-step estimation, 
we bootstrap, where the bootstrap is resampling all data used in both steps. 


. * Bootstrap for Heckman two-step estimator using tobit chapter example 
. bootstrap _b, reps(400) seed(10101) nodots nowarn: hecktwostep 


Bootstrap results 


age . 202124 
female . 2891575 
educ .0119928 
blhisp -.1810582 
totchr . 4983315 
ins -.0474019 
invmills -.4801696 
_cons 5.302572 


Observed Bootstrap 
coefficient std. err. 


.024179 


. 0667082 
.0112873 
0622552 
0418504 
0511954 
. 2696967 
. 2753108 


P>|z| 


oo0oo0oo0oo0o0o0oO 


. 000 
. 000 
. 288 
.004 
. 000 
.354 
.075 
. 000 


Number of obs = 
Replications 


3,328 
= 400 


Normal-based 


[95% conf. 


. 154734 
. 1584118 
.0101298 
. 3030762 
.4163063 
. 1477431 
-1.008766 

4.762973 


interval] 


. 2495141 
.4199032 
.0341154 
.0590402 
. 5803568 
.0529393 
. 0484263 
5.842171 


The standard errors are generally within 10% of those given in chapter 19, 


which are based on analytical results. 


The bootstrap for a sequential two-step estimator has the advantage of 
convenience but can require considerable computational time. An alternative 
way to obtain the standard errors is to stack the equations for each step and 
estimate jointly by (just-identified) generalized methods of moments; see 
section 19.6.4 for this approach. 


12.4.6 Bootstrap Hausman test 


The Hausman test statistic, presented in section 11.9.5, is 


0—0 


n= (0-8) {PG 


~ 


ia Q — 8) ~ x?(h) under Ho 


where ĝ and g are different estimators of @. 


Standard implementations of the Hausman test, including the hausman 
command presented in section 11.9.5, require that one of the estimators be 
fully efficient under Ho. Great t simplification occurs because 
Var(6 — 6) = Var(0) — Var(6 ) if g is fully efficient under Hy. For some 
likelihood-based estimators, correct model specification is necessary for 
consistency, and in that case the estimator is also fully efficient. But often it 
is standard to not require that the estimator be efficient. In particular, if there 
is reason to use robust standard errors, then the estimator is not fully 
efficient. 


The bootstrap can be used to estimate Var(6 = 0) , without the need to 
assume that one of the estimators is fully efficient under Hp. The B 
replications yield B estimates of g and a and hence of g — g. We estimate 
Van (8 — 0) with {1/(B — 1)} Zp — 8 — Bris) @ — 8 — Bain)» Where 
Tiie = (1/B) Zp — 4): 


As an example, we consider a Hausman test for endogeneity of a 
regressor based on comparing instrumental-variables (Iv) and ordinary least- 
squares (OLS) estimates. Large values of H lead to rejection of the null 
hypothesis that all regressors are exogenous. 


The following program is written for the two-stage least-squares 
example presented in section 7.4.6. 


x Program to return (bi-b2) for Hausman test of endogeneity 
program hausmantest, eclass 
version 17 
tempname b bols biv 
regress ldrugexp hi_empunion totchr age female blhisp linc, vce(robust) 


matrix ~bols” = e(b) 
ivregress 2sls ldrugexp (hi_empunion = ssiratio) totchr age female blhisp /// 
linc 
matrix ~biv’ = e(b) 
matrix `b’ = “bols” - `biv’ 
ereturn post `b’ 
end 


This program can be checked by typing hausmantest in isolation. 


We then run the bootstrap. 


. * Bootstrap estimates for Hausman test using IV chapter example 
. qui use mus207mepspresdrugs, clear 


. bootstrap _b, reps(400) seed(10101) nodots nowarn: hausmantest 


Bootstrap results Number of obs = 10,367 
Replications = 400 
Observed Bootstrap Normal-based 
coefficient std. err. Zz P>|z| [95% conf. interval] 
hi_empunion . 8847442 . 1876556 4.71 0.000 .5169459 1.252542 
totchr - .0090062 .0037012 -2.43 0.015 -.0162604 -.0017519 
age . 0088707 .002019 4.39 0.000 .0049135 .0128278 
female .0710012 .016444 4.32 0.000 .0387715 . 1032308 
blhisp .0602153 .0169786 3.55 0.000 .0269379 . 0934928 
linc -. 0699006 .0149192 -4.69 0.000 -.0991416 -.0406595 
_cons -.8461408 . 1909968 -4.43 0.000 -1.220488 -.4717939 


For the single potentially endogenous regressor, we can use the ¢ statistic 
given above, or we can use the test command. The latter yields 


. * Perform Hausman test on coefficient of the potentially endogenous regressor 
. test hi_empunion 


( 1) hi_empunion = 0 


chi2( 1) 22.23 
Prob > chi2 = 0.0000 


The null hypothesis of regressor exogeneity is strongly rejected. 


The test command can also be used to perform a Hausman test based on 
all regressors. We have 


* Perform Hausman test on the coefficients of all regressors 
. test hi_empunion totchr age female blhisp linc _cons 


( 1) hi_empunion = 0 

( 2) totchr = 0 

( 3) age = 0 

( 4) female = 0 

( 5) blhisp = 0 

( 6) line = 0 

( 7) _cons = 0 

chi2( 7) = 23.08 
Prob > chi2 = 0.0017 


The preceding example has wide applicability for robust Hausman tests. 
12.4.7 Bootstrap standard error of the coefficient of variation 


The bootstrap need not be restricted to regression models. A simple example 
is to obtain a bootstrap estimate of the standard error of the sample mean of 
docvis. This can be obtained by using the bootstrap se: mean docvis 
command. 


A slightly more difficult example is to obtain the bootstrap estimate of 
the standard error of the coefficient of variation (= s,/x) of doctor visits. 
The results stored in r() after summarize allow the coefficient of variation to 
be computed as r (sd) /r (mean) , SO we bootstrap this quantity. 


To do this, we use boot strap with the expression coef fvar= 
(x (sd) /r (mean) ). This bootstraps the quantity r (sd) /r (mean) and gives it 
the name coeffvar. We have 


. * Bootstrap estimate of the standard error of the coefficient of variation 
. qui use mus212bootdata, clear 


. bootstrap coeffvar=(r(sd)/r(mean)), reps(400) seed(10101) nodots 
> nowarn: summarize docvis 


Bootstrap results Number of obs 50 
Replications = 400 


Command: summarize docvis 
coeffvar: r(sd)/r(mean) 


Observed Bootstrap Normal-based 
coefficient std. err. Zz P>|zl [95% conf. interval] 
coeffvar 1.898316 . 269266 7.05 0.000 1.370564 2.426067 


The normal-based bootstrap 95% confidence interval for the coefficient of 
variation is [1.37, 2.43]. 


12.5 Percentile-t bootstraps with asymptotic refinement 


Asymptotic refinement, defined in section 12.2.3, can be obtained using Bca 
confidence intervals that are provided by the estat bootstrap 
postestimation command; see sections 12.3.6—12.3.8. 


In this section, we present an alternative and more widely used method 
to obtain asymptotic refinement. The percentile-t method has general 
applicability to hypothesis testing and confidence intervals. And it is the 
basis for the wild cluster bootstrap, presented in section 12.6, for improved 
inference when there are few clusters. 


12.5.1 Percentile-t method 


Standard asymptotic inference is based on the central limit theorem, which is 
based on a series expansion of the characteristic function that neglects 
higher-order terms. An asymptotic refinement can be obtained by using 
higher-order terms in an Edgeworth series expansion, a power series in 
N~-1/2. The percentile-¢ bootstrap that bootstraps the ¢ statistic is an 
empirical method that implements the Edgeworth expansion. A brief 
summary of the theory is given in Cameron and Trivedi (2005, chap. 11.4), 
and more detail is given in, for example, Efron and Tibshirani (1993, 

chap. 22) and Horowitz (2001). 


This approach to obtaining asymptotic refinement requires bootstrapping 
a quantity that is asymptotically pivotal, meaning that its asymptotic 
distribution does not depend on unknown parameters. The estimate g is not 
asymptotically pivotal, because its variance depends on unknown 
parameters. Percentile methods therefore do not provide an asymptotic 
refinement unless an adjustment is made, notably, that by the Bca percentile 
method. 


The ¢ statistic is asymptotically pivotal, however, because its asymptotic 
distribution is the standard normal distribution, which has no unknown 
parameters. Percentile-t methods or bootstrap-t methods bootstrap the t 
Statistic, 


t= (6 - bo) /se (ê) (12.2) 


where ĝo is the null hypothesis value of 8. 


The bootstrap views the original sample as the DGP, so the bootstrap sets 
the DGP value of @ to be @. So in each bootstrap resample, we compute a t 
statistic centered on ĝ, 


tt = (6; = 8) /se (6; ) (12.3) 


where 0% is the parameter estimate in the pth bootstrap and se(0*) isa 
consistent estimate of the standard error of 0%, often a robust or cluster— 


robust standard error. As already noted, tý is centered on g because @ is the 
true DGP value in the bootstrap sample. 


The B bootstraps yield the ¢-values t],...,¢%, whose empirical 
distribution is used as the estimate of the distribution of the ¢ statistic. 


For a one-sided test of ĝo, we use either Bo! D 1(t% < t) or 
Bl, Se) 


For a two-sided test of Ho: 8 = 0, there are two ways to compute the p- 
value of the original test statistic t = 9 /se(@). A symmetric ¢ test uses 


B 
1 * 
P= 5 > 1 (ltol* > lel) (12.4) 
b=1 


which is the fraction of times in B replications that |t*| > |t|. This method is 
the one necessarily used if we generalize to a test statistic that is always 
positive, such as a test statistic that is asymptotically chi-squared distributed. 


An equal-tail two-sided test instead uses 


B B 
1 
p= amin} Yd (tă < t) DDE (t >t } (12.5) 


b=1 Bet 


This rejects Hg at level a if either a lower-tail test rejects at level a/2 or an 
upper-tail test rejects at level q/2. This procedure is especially warranted 
when rejection in one tail is more likely than in the other tail, such as for a 
Wald test based on the two-stage least squares (2SLS) estimator, which is 
biased when instruments are weak instruments. 


There are several ways to form confidence intervals. A percentile-t¢ 
symmetric 95% confidence interval for 6 uses 


|ô- |t*|o.95 X se (0) 0+ |t*|o.95 X se (0) 


where |t*|o.95 is the 95th percentile of the distribution of |t}|,..., Ith]. 


One variant of an equal-tailed percentile-t 95% confidence interval for 8 
is 


|ô- to.975 X se(0), 0 — to.025 X se (4) (12.6) 


See, for example, Efron and Tibshirani (1993, 173—174) or Hansen (2022a, 
283-284). 


With the definition in (12.6), the confidence interval can include 0, even 
though we would reject Ho : 0 = 0 using the equal-tail two-sided test in 
(12.5). An alternative equal-tailed percentile-¢ 95% confidence interval is 
therefore obtained by inverting the equal-tailed two-sided test defined in 


(12.5); see section 11.3.13 for obtaining a confidence interval by inverting a 
test statistic. 


12.5.2 Percentile-t Wald test 


Stata does not automatically produce the percentile-¢ method. In this section, 
we present a general method to implement the percentile-¢ method using the 
bootstrap prefix with bootstrap pairs resampling. Section 12.6 then presents 
the community-contributed boot test command, which implements the 
percentile-t method using wild bootstrap resampling. 


We continue with a count regression of docvis on chronic. A 
complication is that the standard error given in either (12.2) or (12.3) needs 
to be a consistent estimate of the standard deviation of the estimator. So we 
use bootstrap to perform a bootstrap of poisson, where the VCE is estimated 
with the vce (robust) option, rather than using the default Poisson standard- 
error estimates that are greatly downward biased. 


We store the sample parameter estimate and standard error as local 
macros before bootstrapping the ¢ statistic given in (12.3). The 999 bootstrap 
values t7, ..., tg99 are saved as variable tstar in percentilet.dta. 


* Percentile-t for a single coefficient: Bootstrap the t statistic 
. qui use mus212bootdata, clear 


qui poisson docvis chronic, vce(robust) 

local theta = _b[chronic] 

local setheta = _se[chronic] 
. bootstrap tstar=((_b[chronic]-"theta~)/_se[chronic]), seed(10101) nodots 
> reps(999) saving(percentilet, replace): poisson docvis chronic, 


> vce (robust) 
(file percentilet.dta not found) 


Bootstrap results Number of obs 50 
Replications = 999 


Command: poisson docvis chronic, vce(robust) 
tstar: (_b[chronic] -.9833014421442415)/_se[chronic] 


Observed Bootstrap Normal-based 
coefficient std. err. z P>|z| [95% conf. interval] 


tstar 0 1.288018 0.00 1.000 -2.52447 2.52447 


The output indicates that the distribution of ¢* is considerably more 
dispersed than a standard normal, with a standard deviation of 1.29 rather 
than 1.0 for the standard normal. 


The percentile-t tests and confidence intervals for chronic are based on 
the 999 values of ¢* saved in percentilet.dta. 


. * percentile-t: Plot the density of tstar 
. use percentilet, clear 
(bootstrap: poisson) 


. tabstat tstar, stats(count mean sd skew kurt) 


Variable N Mean SD Skewness Kurtosis 


tstar 999 .0086038 1.288018 .0729155 3.491272 


. kdensity tstar, bw(0.2) normal legend(off) xtitle("tstar") 
> note(" ") scale(1.2) plotregion(style(none)) title(" ") 


The ¢* values have approximate mean of zero, little asymmetry, and little 
nonnormal kurtosis. The big departure from the asymptotic result of an 
N (0, 1) distribution is that the standard deviation of 1.29 is much greater 
than 1. 


Figure 12.1 reveals that the density of ¢* is symmetric but is a bit more 
peaked than the normal. And, as already noted, the standard deviation is 
considerably greater than the standard normal value of one. Thus, we expect 
the percentile-t method to lead to larger p-values and wider confidence 
intervals. 


Figure 12.1. Distribution of ¢* from pairs percentile-¢ bootstrap 


The p-value for a symmetric two-sided Wald test of Ho : Genronic = O is 
obtained as follows 


. * Percentile-t p-value for symmetric two-sided Wald test of HO: theta = 0 
. qui count if abs(~theta’/~setheta’) < abs(tstar) 


. display "p-value = " r(N)/_N 
p-value = .14614615 


We do not reject Ho: Bcnronic = 0 against Ho: Genronic Æ 0 at the 0.05 
level, because p = 0.146 > 0.05. By comparison, if we use the usual 
standard normal critical values, p = 0.056, which is considerably smaller. 


The above code can be adapted to apply to several or all parameters by 
using the bootstrap prefix to obtain b and _se, saving these in a file, using 
this file, and computing for each parameter of interest the values ¢* given ĝ*, 
g» and se(@*). 


12.5.3 Percentile-t Wald confidence interval 


We obtain a percentile-¢ 95% confidence interval for the coefficient of 
chronic using the simpler (12.6), rather than inverting the test defined in 


(12.5), where tř, ...,¢7 were obtained in the previous section. We have 


. * Percentile-t confidence interval 
. _pctile tstar, p(2.5,97.5) 


. scalar lb = “theta” + r(r1)*`setheta’ 


. scalar ub = “theta” + r(r2)*> setheta’ 
. display "2.5 and 97.5 percentiles of t* distn: " r(r1) ", " r(r2) _n 
> "95 percent percentile-t confidence interval is ("lb "," ub ")" 


2.5 and 97.5 percentiles of t* distn: -2.4142661, 2.6618268 
95 percent percentile-t confidence interval is (-.26122706,2.3554449) 


The confidence interval is |—0.26, 2.36], compared with [—0.03, 1.99], 
which is obtained by using the robust estimate of the VCE with poisson. The 
wider confidence interval is due to the bootstrap-¢ critical values of — 2.41 
and 2.66, much larger than the standard normal critical values of — 1.96 and 
1.96. The confidence interval is also wider than the other bootstrap 
confidence intervals given in section 12.3.8. 


Percentile-t 95% confidence intervals, like Bca confidence intervals, 
have the advantage of having a coverage rate of 0.95 + O(N~*) rather than 
0.95 + O(N-1/2), Efron and Tibshirani (1993, 160, 184, 188, 326) favor the 
Bca method for confidence intervals because it is transformation-respecting 
and percentile-t can perform erratically in small-sample nonparametric 
settings. But they state on page 326 that “generally speaking, the bootstrap-t¢ 
works well for location parameters,” and regression coefficients are location 
parameters. Econometrics theoretical and empirical studies use the 
percentile-t method if asymptotic refinement is desired. 


12.6 Wild bootstrap with asymptotic refinement 


The wild bootstrap is a bootstrap with asymptotic refinement (using 
percentile-t methods) that uses the wild bootstrap resampling method, rather 
than the pairs resampling method used in the preceding section. Monte Carlo 
studies find that the wild bootstrap generally performs better than the pairs 
bootstrap. The wild bootstrap can be easily implemented for many common 
estimators using the boottest command presented below in section 12.6.2. 
And the didregress and xtdidregress commands include the wild cluster 
bootstrap as an option. 


The wild bootstrap was first proposed in econometrics for OLS with 
heteroskedastic errors. See Horowitz (2001, 3215-3217), Davison and 
Hinkley (1997, 272), or Cameron and Trivedi (2005, 376) for discussion. 
There were few applications in this case because if N is so small that 
standard asymptotic methods provide a poor guide, then often there is no 
point in continuing analysis: the small sample size leads to very imprecise 
parameter estimates. 


Subsequently, the wild bootstrap has been extended to two common 
settings in applied microeconometrics studies where the precision of 
estimated coefficients may be reasonable yet standard asymptotic methods 
can perform poorly. 


Cameron, Gelbach, and Miller (2008) proposed cluster—robust inference 
for OLS with clustered data or short panel data with few clusters. Given 
sufficient observations per cluster, estimation can be quite precise, but the 
usual asymptotic methods based on G — œ, where G is the number of 
clusters, performs very poorly when G is small; see section 6.4.6. 


Davidson and MacKinnon (2010) considered inference for 2SLs with 
heteroskedastic errors when instruments are weak, in which case asymptotic 
theory can perform poorly even for quite large N; see section 7.5. 


The preceding wild bootstraps require resampling residuals, limiting 
analysis to least-squares estimators in models with additive errors. Kline and 
Santos (2012) proposed resampling the score, rather than the residuals. This 


extends the methods to m-estimators, such as the logit MLE and nonlinear 
generalized method of moments estimators. They considered both 
heteroskedastic—robust and cluster—robust inference. 


The discussion below considers all of these examples of the wild 
bootstrap, with focus on the wild cluster bootstrap for OLs. 


Before doing so, we note that developing better inferential methods 
when the usual asymptotic methods perform poorly in finite samples, with 
incorrectly sized tests and confidence intervals with incorrect coverage rates, 
is an active area of research. One approach is to develop new methods in 
specific contexts, such as those for few clusters mentioned in section 6.4.6 
and for weak instruments presented in section 7.7. An alternative approach is 
to use bootstraps with asymptotic refinement. These new methods are 
proven mathematically to be asymptotically better than existing methods and 
hence hopefully perform better in finite samples of typical size. But there is 
no guarantee. Finite-sample performance is established by Monte Carlo 
experiments, with designs that mimic typical settings encountered in 
practice. This improved performance in Monte Carlo experiments does not 
necessarily carry over to a specific application with real data. 


12.6.1 Wild cluster bootstrap 


We consider inference with clustered errors and few clusters in the linear 
model y = XG + u. For simplicity, we test a restriction on a single 
parameter, with null hypothesis Ho: bj = Bjo. Inference is based on the 
Wald test statistic + — ( 8, j — Bjo) /se(8 B;)> where se( 6 B;) is a cluster—robust 
standard error and B is obtained by OLS regression of y on X. A pairs cluster 
bootstrap resamples over (yg, Xg), where g denotes the gth of G clusters. 


We instead use a wild cluster bootstrap. The benchmark method proceeds 
as follows; a number of variants are discussed later. 


First, obtain residuals G a. ere. gBrest? for the gth cluster, where 
Brox , Is the restricted estimate obtained by OLS regression of y on X that 


imposes Ho. For the test of Ho: 8; = 0, this just drops the jth regressor 
from the equation. 


Second, in the bth of B bootstraps, a) for cluster g of G clusters, 
randomly set Wy = ü, with probability 0.5 or Uj = —U,; b) form new 
outcome variables Y? = XgBrest + Uy? c) regress the resulting y* on X 


~ 


giving estimate g and hence > with cluster-robust standard error se( Bs ); 
and d) calculate ¢* = (8% = B;) /se(B*). 


Third, obtain asymptotic refinement by applying a percentile-t method 
given in section 12.5.1 to the B Wald statistics tï, ..., th. 


This bootstrap is wild in that it chooses only one of two possible values 
for ü% and hence y,. Doing so over G clusters yields 2 possible 
realizations of the data (y, X). The weights on the residuals of either — 1 or 
1 are called Rademacher weights. Canay, Santos, and Shaikh (2021) provide 
an alternative derivation as a randomization test with G fixed and the 
number of observations per cluster going to infinity. In that case, the clusters 
must satisfy a strong homogeneity restriction. 


12.6.2 The boottest command 


The community-contributed boottest command (Roodman et al. 2019) is a 
postestimation command that can be used following linear OLS and Iv 
regression, including regress and ivregress, and nonlinear regression, 
including logit, glm, and gsem. It can also follow linear regression 
commands such as areg, xtreg, fe, and xtivreg, fe that absorb a single set 
of fixed effects. 


The command format is 


boottest | indeplist | [3 options | 


where indeplist is a list of hypotheses to be tested separately. In the simplest 
case, the test is on a single regressor, for a test that a single regressor’s 
coefficient is zero, or a particular value such as boottest x=0.5. 


The weighttype() option determines the way that the errors u; are 
derived from the original errors u,. The default uses Rademacher weights 
that randomly set u; = U, with probability 0.5 or ü; = —U,y with 


probability 0.5. The new residuals have the same mean and variance as the 
original residuals. Mammen and gamma weights additionally match the third 
moment of the residuals, so they are potentially better with skewed residuals. 
But, unlike Rademacher weights, they do not match the fourth moments of 
the residuals, and simulations suggest that matching the fourth moment is 
more important. Normal weights multiply u, by a draw from the standard 
normal distribution. 


The two-point Rademacher weights lead to at most 2C bootstrap 
samples. When G < 10, it is better to use the weighttype (webb) option that 
uses 6-point Webb weights that randomly set Uj = ag X Ug, where 4g takes 
1 of the 6 weights + ,/3/2, +1, + ,/1/2 with equal probabilities of 1/6. 


There are many other options; only a few are mentioned here. The 
seed (#) option should always be used for replicability. The reps (#) option 
has a default of 999. The larger the value, the less results vary with the 
choice of seed, and more generally set B such that (B + 1)/a is an integer. 
The default is to use a version of the wild bootstrap that is appropriate given 
the specified vce () for the estimator. The robust and cluster () options 
override these defaults. The nonu11 option does not impose the null 
hypothesis before bootstrapping, using y% = X,( + uj rather than 
y, =X ores + ü}. This is also theoretically justified, but simulations find 


it is better to impose the null; intuitively, efficiency is improved by imposing 
the null hypothesis. The pt ype () option sets the p-value type to symmetric 
(the default), equaltail [see (12.4)], lower, and upper. 


In addition to the Wald test, the boottest command can be applied to the 
score (or Lagrange multiplier) test and, for Iv estimation, the Anderson— 
Rubin (AR) test. The latter use of boot test enables cluster—robust inference 
with few clusters when instruments are weak. Finally, the option score 
provides the alternative wild score bootstrap, which can be used for 
nonlinear model estimators such as the logit MLE. 


12.6.3 Wild cluster bootstrap example 


As a few clusters example, we use a dataset from the U.S. Current 
Population Survey based on the study by Hersch (1998). Interest lies in how 


individual log hourly wage (1nw) varies with the job injury rate in the 
individual’s occupation (occrate). The other regressors are potential years 
of work experience and its square (potexp, potexpsq), years of education 
(educ), union membership (union), race (nonwhite), and region dummies 
(northe, midw, west). 


There are two complications. First, occrate is the same for all 
individuals in that occupation, so we need to obtain standard errors that are 
clustered on the occupation identifier occ_id. Second, we use an extract that 
covers only 10 occupations, so G = 10, and there is a few clusters problem. 


We first perform OLS with heteroskedastic—robust standard errors that 
ignore clustering. 


. * Few clusters OLS: Heteroskedastic-robust standard errors 
. qui use mus212occfewcluster, clear 


. global xlist potexp potexpsq educ union nonwhite northe midw west 


. regress lnw occrate $xlist, vce(robust) 


Linear regression 


lnw 


occrate 
potexp 
potexpsq 
educ 
union 
nonwhite 
northe 
midw 
west 
_cons 


Coefficient 


- .0283627 
. 0400968 
-.0005958 
.0870259 
. 2246745 
-.08293 

. 0462904 
-.0175544 
.0371981 
. 9046225 


Robust 


std. err. 


.003188 
.0037081 
. 0000843 
. 0062229 
.0310659 
.0377609 
. 0335869 
.0322813 
.0338121 
.0957027 


Number of obs 


F(9, 1584) 
Prob > F 
R-squared 
Root MSE 


P>|t | 


. 000 


. 000 
.000 


oo0oo0oo0oo0o0000O0O 


. 000 


. 000 =, 


. 000 Sy 


.028 = 
.168 Sa 
.587 Si 
.271 = 


[95% conf. 


0346158 
. 0328235 
0007612 
.0748199 
. 1637398 
1569966 
0195892 
0808729 
0291231 
. 7169052 


1,594 
90.00 
0.0000 
0.3589 
.45176 


interval] 


-.0221095 
.0473701 
- . 0004304 
.0992318 
. 2856092 
- . 0088634 
.1121699 
.045764 
. 1035193 
1.09234 


Surprisingly, wages fall with job injury risk, even after controlling for 
regressors that appear to be mostly statistically significant (Hersch [1998] 
obtained the expected positive sign when attention is restricted to female 


workers). 


We then perform OLs with cluster—robust standard errors. The regress 
command bases inference based on the ¢(9) distribution because G — 1 = 9. 


. * Few clusters OLS: Cluster-robust standard errors 
. regress lnw occrate $xlist, vce(cluster occ_id) 


Linear regression Number of obs = 1,594 
F(8, 9) = 
Prob > F = ; 
R-squared = 0.3589 
Root MSE = -45176 


(Std. err. adjusted for 10 clusters in occ_id) 


Robust 
lnw | Coefficient std. err. t P>|t| [95% conf. interval] 
occrate - .0283627 .0104466 -2.72 0.024 -.0519945 -.0047308 
potexp .0400968 .007381 5.43 0.000 .0233998 .0567938 
potexpsq -.0005958 .0001146 -5.20 0.001 -.000855 -.0003366 
educ .0870259 .017553 4.96 0.001 .0473183 . 1267334 
union . 2246745 .0887073 2.53 0.032 .0240046 . 4253444 
nonwhite - .08293 .0538803 -1.54 0.158 -.2048157 .0389558 
northe .0462904 . 0220224 2.10 0.065 -.0035277 .0961085 
midw -.0175544 .0281175 -0.62 0.548 -.0811605 .0460517 
west .0371981 .0348024 1.07 0.313 -.0415303 . 1159265 
_cons . 9046225 . 2433333 3.72 0.005 . 3541643 1.455081 


The cluster—robust standard error of the cluster-invariant regressor occrate 
is 3.28 times larger. This larger standard error combined with the use of the 
t(9) distribution results in a much larger p = 0.024. Note that the overall F 
statistic is not reported. This is because the rank of the cluster—robust VCE 
can be shown to be at most the minimum of the number of regressor (K) 
and G — 1, and here K = 10 and G — 1 = 9. We could nonetheless perform, 
for example, a test of eight linearly dependent restrictions. 


Simulation studies find that even using t(G — 1) critical values leads to 
overrejection. This is especially the case if the clusters are unbalanced 
because of differences in cluster sizes or substantial differences in regressor 
values across clusters. The community-contributed clusteff command (Lee 
and Steigerwald 2018), based on results in Carter, Schnepel, and 
Steigerwald (2017), provides a conservative estimate of the effective number 
of clusters. 


Applying the clusteff command yields 


. * Few clusters OLS: CSS conservative effective number of clusters 
. Clusteff lnw occrate $covars, cluster(occ_id) test(occrate) 
Number of clusters: 10 

Estimated effective number of clusters: 4.6429332 

Warning: G* estimated to be below 50. 


The estimated effective number of clusters G* equals 4.64. The estimate is 
conservative because it is based on the assumption of perfect intracluster 
correlation of the errors. If we use the ad hoc adjustment of G* — 1 degrees 
of freedom, a t(4) test yields p = 0.053 (=2*ttail (4,2.72)) rather than 

p = 0.024 using the ¢(9) distribution. 


We apply the boottest postestimation command with default settings, 
aside from the following adjustments. First, for replicability we set the seed. 
Second, to reduce dependence on the seed value, we set the number of 
simulations to 9,999, rather than the default 999; computation following OLS 
is nonetheless very fast. Third, because with G = 10 the default Rademacher 
weights lead to only 210 — 1024 possible bootstrap samples, we use Webb 
weights. We obtain 


. * Few clusters OLS: Wild cluster bootstrap with Webb weights 
. boottest occrate, seed(10101) reps(9999) weight (webb) 


Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap 
> clustering by occ_id, Webb weights: 
occrate 
t(9) 
Prob>|t| 


-2.7150 
0.0230 


95% confidence set for null hypothesis expression: [-.07838, -.01119] 


The test is one of whether the coefficient of occrate equals zero. Using the 
equal-tail two-sided test defined in (12.5) yields p = 0.023, so we reject Ho 
at level 0.05 and conclude that occrate is statistically significant. The 
corresponding confidence interval, found by inverting equal-tail two-sided 
tests, is [—0.078, — 0.011]. 


12.6.4 Score-based wild cluster bootstrap 
The wild cluster bootstrap described above resamples the residuals. Kline 


and Santos (2012) proposed instead resampling the score, which is the first 
derivative of the objective function and for OLS equals X’y. The advantage 


of resampling the score is that we can extend the method to nonlinear 
models, such as an m-estimator that maximizes objective function Q(0) has 
score 0Q(0)/00. 


Further details are given in Kline and Santos (2012). To give some 
intuition for the method, we consider linear regression. The wild bootstrap 
sets y* = X Brest T u*, on B = (X/X) 71x! * = Bros T (X'X) IKT. 
It follows that in the bth resample, 


~ 


ß = Prest T (X'X) `! ` X, ti; 


where (X,U,)* = a,X,U, with ag the wild bootstrap weights such as 
Rademacher weights. So we could obtain the same estimate g by randomly 
weighting X,U,, rather than randomly weighting u, and multiplying by X4. 
For example, randomly set Xü}; = X,Uy with probability 0.5 or 

Xü; = —X,u, with probability 0.5. 


We illustrate this in a logit model example. The dependent variable 
equals 1 if the hourly wage exceeds 10 and equals 0 otherwise, and the 
number of clusters is reduced to 7. The usual logit estimates follow: 


. * Few clusters logit: Cluster-robust standard errors 
. drop if occ_id==63 | occ_id==113 | occ_id==133 
(461 observations deleted) 


. generate dlnw = lnw > 1n(10) 


. logit dlnw occrate $xlist, nolog vce(cluster occ_id) 


Logistic regression Number of obs = 1,133 
Wald chi2(5) = 
Prob > chi2 = . 
Log pseudolikelihood = -585.41192 Pseudo R2 = 0.2531 


(Std. err. adjusted for 7 clusters in occ_id) 


Robust 
dlnw Coefficient std. err. Zz P>|z| [95% conf. interval] 
occrate -.1227231 .0574737 -2.14 0.033 -.2353694 -.0100768 
potexp . 1907595 .0403517 4.73 0.000 .1116716 . 2698474 
potexpsq -.003126 .0006825 -4.58 0.000 -.0044637 -.0017883 
educ . 3215529 . 1051269 3.06 0.002 .115508 . 5275978 
union 1.640939 . 3621219 4.53 0.000 . 9311933 2.350685 
nonwhite -.3351905 .215012 -1.56 0.119 -. 7566063 . 0862253 
northe . 2244681 . 120476 1.86 0.062 -.0116604 - 4605967 
midw . 1987152 .1778251 1.12 0.264 -.1498155 . 5472459 
west . 3509475 . 2354718 1.49 0.136 -.1105689 . 8124638 
_cons -5.860125 1.386332 -4.23 0.000 -8.577287 -3.142964 


For nonlinear estimators such as logit, Stata computes p-values using the 
standard normal distribution. It is better to at least use the t(G — 1), which 
for occrate yields p = 0.076 (as 2*ttail(6,2.14)=0.076) or from 
command test occrate, df (6). 


With only 7 clusters, there are at most 97 — 198 possible different 
bootstrap samples if the two-point Rademacher weights are used. So we use 
the six-point Webb weights. We obtain 


. * Few clusters logit: Score wild cluster bootstrap with Webb weights 
. boottest occrate, seed(10101) weight(webb) reps(999) 


Re-running regression with null imposed. 


Logistic regression Number of obs = 1,133 
Wald chi2(8) = 229.15 
Log likelihood = -606.96732 Prob > chi2 = 0.0000 


( 1) [dlnwloccrate = 0 


dinw Coefficient Std. err. z P>lz| [95% conf. interval] 
occrate O (omitted) 
potexp . 1924966 .0244205 7.88 0.000 . 1446334 . 2403598 
potexpsq -.0030514 .0005195 -5.87 0.000 -.0040696 -.0020333 
educ . 3952463 .0371796 10.63 0.000 . 3223756 .468117 
union 1.310985 . 1950676 6.72 0.000 . 9286596 1.693311 
nonwhite -.3631165 . 2221503 -1.63 0.102 -.798523 .0722901 
northe . 3039718 . 1968254 1.54 0.122 -.0817988 .6897425 
midw . 1495338 . 1932623 0.77 0.439 - .2292533 . 5283209 
west .4067953 . 20723 1.96 0.050 .000632 .8129586 
_cons -7.462204 . 5626493 -13.26 0.000 -8.564976 -6.359431 


Score bootstrap-t, null imposed, 999 replications, Wald test, bootstrap 
> clustering by occ_id, Webb weights: 
occrate 


Zz = -2.2957 
Prob>|z| 0.1221 


The wild cluster bootstrap p-value for occrate 1s 0.122 rather than 0.033, 
using standard normal p-values, or 0.076, using t(G — 1) p-values for this 
example with G = 7. We note that a ¢(3) distribution yields p = 0.122 
because 2*ttail(3,2.14) = 0.1218. 


12.6.5 Wild bootstrap for IV estimation 


Davidson and MacKinnon (2010) propose use of the wild bootstrap 
following Iv estimation for independent heteroskedastic errors. The 
boottest command implements this case as well as extension to clustered 
errors. 


We apply the wild bootstrap to the data used in section 7.7, restricting 
analysis to the first 1,500 observations because computational time is much 
longer than following OLs. There is one endogenous regressor 
(hi_empunion) and two instruments (ssiratio and firmsz). The instruments 


are reasonably strong so that in this example, the wild bootstrap should lead 
to little change. 


Using heteroskedastic—robust standard errors, we obtain 


* Weak IV with independent observations: Wild bootstrap of Wald test 
qui use mus20/7mepspresdrugs, clear 


. keep if _n <= 1500 & linc != 
(8,921 observations deleted) 


. global x2list totchr age female blhisp linc 


. ivregress 2sls ldrugexp $x2list (hi_empunion = ssiratio firmsz), 
> noheader vce(robust) 


Robust 
ldrugexp | Coefficient std. err. z P>|z| [95% conf. interval] 
hi_empunion - .5693183 -4678496 -1.22 0.224 -1.486287 . 34765 
totchr . 4658999 .0272349 17.11 0.000 -4125205 .5192792 
age - .0033616 .0055211 -0.61 0.543 -.0141828 0074596 
female -.06974 0791641 -0.88 0.378 - . 2248989 .0854188 
blhisp -. 1410898 . 1049148 -1.34 0.179 -.346719 .0645395 
linc . 0680306 .0488175 1.39 0.163 -.0276499 . 1637111 
_cons 5.606257 518471 10.81 0.000 4.590073 6.622442 


Instrumented: hi_empunion 
Instruments: totchr age female blhisp linc ssiratio firmsz 


. boottest hi_empunion, seed(101010) reps(999) 


Wild bootstrap-t, null imposed, 999 replications, Wald test, Rademacher weights: 
hi_empunion 


z= -1.2169 
Prob>|zl = 0.2372 


95% confidence set for null hypothesis expression: [-1.541, .3135] 


For the endogenous regressor hi_empunion, the wild bootstrap p = 0.237 is 
close to 0.224 obtained using the usual asymptotics. 


The AR test is specifically designed to accommodate weak instruments; 
see section 7.7.2. A wild bootstrap of the AR test yields the following. 


* Weak IV with independent observations: Wild bootstrap of AR test 
. boottest, ar reps(999) 


Wild bootstrap-t, null imposed, 999 replications, Anderson-Rubin Wald test, 
> Rademacher weights: 
hi_empunion 
2.2357 
0.3524 


chi2(2) 
Prob > chi2 


95% confidence set for null hypothesis expression: [-1.868, .4822] 


. qui regress ldrugexp ssiratio firmsz $x2list, vce(robust) 
. test ssiratio firmsz // The standard AR test for this example 


( 1) ssiratio = 0 
( 2) firmsz = 0 


FC 2, 1438) = 1.11 
Prob > F = 0.3293 


The wild bootstrap p = 0.352 is close to 0.329 obtained using the usual 
asymptotics for the AR test. 


12.7 Bootstrap pairs using bsample and simulate 


The bootstrap prefix can be used only if one can provide a single 
expression for the quantity being bootstrapped. If this is not possible, one 
can use the bsample command to obtain one bootstrap sample, compute the 
statistic of interest for this resample, and use the simulate or postfile 
command to execute this command a number of times. 


12.7.1 The bsample command 


The bsample command draws random samples with replacement from the 
current data in memory. The command syntax is 


bsample [ exp | lif | [ in | E options | 


where exp specifies the size of the bootstrap sample, which must be at most 
the size of the selected sample. The strata (varlist), cluster(varlist), 
idcluster (newvar), and weight (varname) options allow stratification, 
clustering, and weighting. The idcluster() option is discussed in 

section 12.3.5. 


In using the bsample command, one should first set the seed for 
reproducibility. 


12.7.2 The bsample command with simulate 


As an example, we use the bsample and simulate commands to reproduce 
the percentile-t Wald test given in section 12.5.2. We first define the 
program for one bootstrap replication. The bsample command without 
argument produces one resample of all variables with a replacement of size 
N from the original sample of size N. 


The program returns a scalar, tstar, that equals ¢* in (12.3). Because 
we are not returning parameter estimates, we use an r-class program. We 
have 


* Program to do one bootstrap replication 
program onebootrep, rclass 

version 17 

drop _all 

use mus212bootdata 

bsample 

poisson docvis chronic, vce(robust) 

return scalar tstar = (_b[chronic]-$theta)/_se[chronic] 
end 


Note that robust standard errors are obtained here. The referenced global 
macro, theta, constructed below, is the estimated coefficient of chronic in 
the original sample. We could alternatively pass this as a program argument 
rather than use a global macro. The program returns tstar. 


We next obtain the original sample parameter estimate and use the 
simulate prefix to run the onebootrep program B times. We have 


* Now do 999 bootstrap replications 
. qui use mus212bootdata, clear 


. gui poisson docvis chronic, vce(robust) 
. global theta = _b[chronic] 
. global setheta = _se[chronic] 


. Simulate tstar=r(tstar), seed(10101) reps(999) nodots 
> saving (percentilet2, replace): onebootrep 


Command: onebootrep 
tstar: r(tstar) 
(file percentilet2.dta not found) 


percentilet2.dta has the 999 bootstrap values tï, .. . , tg99 that can 
then be used to calculate the bootstrap p-value. 


* Analyze the results to get the p-value 
. use percentilet2, clear 
(simulate: onebootrep) 


qui count if abs($theta/$setheta) < abs(tstar) 


. display "p-value = " r(N)/_N 
p-value = .14614615 


The p-value is 0.146, leading to nonrejection of Ho: Benronic = O at the 
0.05 level. This result is exactly the same as that in section 12.5.2 obtained 
using the bootstrap command. 


Additional examples of the use of the bsample command are given in 
section 12.8. 


12.8 Alternative resampling schemes 


There are many ways to resample other than the nonparametric pairs and 
cluster-pairs bootstraps methods used by the Stata bootstrap commands. 
These other methods can be performed by using an approach similar to the 
one in section 12.7.2, with a program written to obtain one bootstrap 
resample and calculate the statistics of interest, and this program then called 
B times. We do so for several methods, bootstrapping regression model 
parameter estimates. 


The programs are easily adapted to bootstrapping other quantities, such 
as the ¢ statistic to obtain asymptotic refinement. For asymptotic 
refinement, there is particular benefit in using methods that exploit more 
information about the DGP than is used by bootstrap pairs. This additional 
information includes holding x fixed through the bootstraps, called a 
design-based or model-based bootstrap; imposing conditions such as 
E(u|x) = 0 in the bootstrap; and for hypothesis tests, imposing the null 
hypothesis on the bootstrap resamples. See, for example, Horowitz (2001), 
MacKinnon (2002), and Cameron, Gelbach, and Miller (2008). 


12.8.1 Bootstrap pairs resampling scheme 


We begin with bootstrap pairs, repeating code similar to that in 
section 12.7.2. The following program obtains one bootstrap resample by 
resampling from the original data with replacement. 


* Program to resample using bootstrap pairs 
program bootpairs 

version 17 

drop _all 

use mus212bootdata 

bsample 

poisson docvis chronic 
end 


To check the program, we run it once. 


. * Check the program by running once 
. bootpairs 


(output omitted ) 


We then run the program 400 times. We have 


. * Bootstrap pairs for the parameters 
. Simulate _b, seed(10101) reps(400) nodots: bootpairs 


Command: bootpairs 


. summarize 

Variable Obs Mean Std. dev. Min Max 
docvis_b_c™c 400 . 9880522 .5386575 —-.5826053 2.693661 
docvis_b_c”s 400 . 9602076 .3536507 -.0689929 1.747957 


The bootstrap estimate of the standard error of Benronic equals 0.5387, as in 
section 12.3.3. 


12.8.2 Parametric bootstrap resampling scheme 


A parametric bootstrap is essentially a Monte Carlo simulation. Typically, 
we hold x; fixed at the sample values; replace yi by a random draw, y;, 
from the density f(y;|x;,@) with @ evaluated at the original sample 
estimate, g; and regress y; on x;. A parametric bootstrap requires much 
stronger assumptions, correct specification of the conditional density of y 
given x, than the paired or nonparametric bootstrap. 


To implement a parametric bootstrap, we adopt the preceding 
bootpairs program and replace the bsample command with code to 
randomly draw y from f (yx, 0) 


For doctor visits, which are overdispersed count data, we use the 
negative binomial distribution rather than the Poisson. We first obtain the 
negative binomial parameter estimates, g, using the original sample. In this 
case, it is sufficient and simpler to obtain the fitted mean, 7; = exp(x! 8), 


and the dispersion parameter, &. We have 


* Fit the model with original actual data and save estimates 
. use mus212bootdata 
. quietly nbreg docvis chronic 
. predict muhat 
. global alpha = e(alpha) 


We use these estimates to obtain draws of y from the negative binomial 
distribution given q and j/;, using a Poisson-gamma mixture (explained in 
section 20.2.2). The rgamma(1/a,a) function draws a gamma variable, v, 
named nu with a mean of 1 and a variance of a, and the rpoisson (nu*mu) 
function then generates negative binomial draws with a mean of 4 and a 
variance of u + au?. We have 


* Program for parametric bootstrap generating from negative binomial 
program bootparametric, eclass 

version 17 

capture drop nu dvhat 

generate nu = rgamma(1/$alpha, $alpha) 

generate dvhat = rpoisson(muhat*nu) 

nbreg dvhat chronic 
end 


We check the program by using the bootparametric command and then 
bootstrap 400 times. 


* Parametric bootstrap for the parameters 
simulate _b, seed(10101) reps(400) nodots: bootparametric 


Command: bootparametric 


summarize 
Variable Obs Mean Std. dev. Min Max 
dvhat_b_ch*™c 400 . 9234596 .4640827 -.8095576 2.208154 
dvhat_b_cons 400 . 9980879 . 2401054 . 3483067 1.625967 
_eq2_b_lna~a 400 .5019578 .2825218 -.4267736 1.298986 


Because we generate data from a negative binomial model and we fit a 
negative binomial model, the average of the 400 bootstrap coefficient 
estimates should be close to the DGP values. This is the case here. Also, the 
bootstrap standard errors are within 10% of those from the negative 
binomial estimation of the original model, not given here, suggesting that 
the negative binomial model may be a reasonable one for these data. 


12.8.3 Residual bootstrap resampling scheme 


For linear OLS regression, under the strong assumption that errors are 
independent and identically distributed (1.1.d.), an alternative to bootstrap 
pairs is a residual bootstrap. This holds X; fixed at the sample values and 
replaces Yi with y* = xÂ + @*, where u; are bootstrap draws from the 
original sample residuals w1,..., ün. This bootstrap, sometimes called a 
design bootstrap, can lead to better performance of the bootstrap by holding 
regressors fixed. 


The bootpairs program is adapted by replacing bsample with code to 
randomly construct už from wu; for each observation ;. This is not 
straightforward, because the bsample command 1s intended to bootstrap the 
entire dataset in memory, whereas here we wish to bootstrap the residuals 
but not the regressors. 


As illustration, we continue to use the docvis example, even though 
Poisson regression is more appropriate than OLS regression. The following 
code performs the residual bootstrap: 


* Residual bootstrap for OLS with i.i.d. errors 
use mus212bootdata, clear 
quietly regress docvis chronic 
predict uhat, resid 
keep uhat 
save residuals, replace 
program bootresidual 
version 17 
drop _all 
use residuals 
bsample 
merge 1:1 _n using mus212bootdata 
regress docvis chronic 


predict xb 
generate ystar = xb + uhat 
regress ystar chronic 

end 


We check the program by using the boot residual command and 
bootstrap 400 times. 


. * Residual bootstrap for the parameters 
. Simulate _b, seed(10101) reps(400) nodots: bootresidual 


Command: bootresidual 


. summarize 
Variable Obs Mean Std. dev. Min Max 
_b_chronic 400 4.802995 2.296399 -.9681436 12.29839 
_b_cons 400 2.766325 1.20739 . 2407407 6.493056 


The output reports the average of the 400 slope coefficient estimates 
(4.803), close to the original sample OLS slope coefficient estimate, not 
reported, of 4.694. The bootstrap estimate of the standard error (2.30) is 
close to the original sample oLs default estimate, not given, of 2.39. This is 
expected because the residual bootstrap assumes that errors are 1.1.d. 


12.8.4 Wild bootstrap resampling scheme 


The wild bootstrap was presented in section 12.6 and implemented using 
the community-contributed boottest command. Here we provide a simple 
example, without asymptotic refinement. 


For linear regression with independent observations, a wild bootstrap 
accommodates the more realistic assumption that errors are independent but 
not identically distributed, permitting heteroskedasticity. This holds x; fixed 
at the sample values and replaces y: with y* = x! 3 + @*, where there are 
various ways to resample u*. Here we use Mammen weights that set 
ü: = a;u; and a; = (1 — V5) /2 ~ —0.618 with the probability 
(1+ /5)/2/5 ~ 0.723 and a; = 1 — (1 — /5)/2 with the probability 
1 — (1+ V5) /2\/5. For each observation, t* takes only two possible 
values, but across all NV observations, there are 9N possible resamples if the 
N values of G; are distinct. 


The preceding boot residual program is adapted by replacing bsample 
with code to randomly draw u% from @; and then form y* = x’ B+ a. 


The Stata code is the same as that in section 12.8.1, except that the 
bsample command in the bootpairs program needs to be replaced with 


code to randomly draw wu; from &; and then form yy = x! 3 + G. 


* Wild bootstrap for OLS with i.i.d. errors 
use mus212bootdata, clear 
program bootwild 
version 17 
drop _all 
use mus212bootdata 
regress docvis chronic 
predict xb 
predict u, resid 
gen ustar = -0.618034*u 
replace ustar = 1.618034*u if runiform() > 0.723607 
gen ystar = xb + ustar 
regress ystar chronic 
end 


We check the program by issuing the bootwild command and bootstrap 
400 times. 


. * Wild bootstrap for the parameters 
. Simulate _b, seed(10101) reps(400) nodots: bootwild 


Command: bootwild 


. summarize 
Variable Obs Mean Std. dev. Min Max 
_b_chronic 400 4.779422 2.986341 -2.205521 11.53673 
_b_cons 400 2.807026 .957078 1.123328 5.709338 


The wild bootstrap permits heteroskedastic errors and yields bootstrap 
estimates of the standard errors (2.99) that are close to the original sample 
OLS heteroskedasticity-robust estimates, not given, of 3.06. These standard 
errors are considerably higher than those obtained by using the residual 
bootstrap, which is clearly inappropriate in this example because of the 
inherent heteroskedasticity of count data. 


The percentile-t method with the wild bootstrap provides asymptotic 


refinement to Wald tests and confidence intervals in the linear model with 
heteroskedastic errors or clustered errors; see section 12.6. 


12.8.5 Subsampling 


The bootstrap fails in some settings, such as a nonsmooth estimator. Then a 
theoretically more robust resampling method is subsampling, which draws a 
resample that is considerably smaller than the original sample. Politis, 
Romano, and Wolf (1999) provide an introduction to this method. 


The method is easily implemented using the bsample command. For 
example, to perform subsampling where the resamples have one-third as 
many observations as the original sample, replace the bsample command in 
the bootstrap pairs with bsample int (_ N/3), where the int () function 
truncates to an integer toward zero. 


We caution against use of this method, however, because in practice it is 
very sensitive to the subsample size chosen. 


12.9 The jackknife 


The delete-one jackknife is a resampling scheme that forms N resamples of 
size (N — 1) by sequentially deleting each observation and then estimating 
6 in each resample. 


12.9.1 Jackknife method 


Let 0; denote the parameter estimate from the sample with the ¿th 
observation deleted, 7 = 1,..., N, let g be the original sample estimate of Ø, 
and let @ — N-! pal ; 0; denote the average of the N jackknife estimates. 


The jackknife has several uses. The Bc jackknife estimate of @ equals 


Nô — (N —1)6 = (1/N) pu {N6 — (N —1)6; 0;}- The variance of the N 
pseudovalues 6, — NO — (N= 1)8; can be used to estimate Var(6). The 


BCa method for a bootstrap with asymptotic refinement also uses the 
jackknife. 


There are two variants of the jackknife estimate of the vce. The Stata 
default is 


and the mse option gives the variation 


rma @)={ EO) (0.-9)} 


The method entails NV resamples, which requires considerable computation 
when N is large. The resamples are not random draws, so there is no seed to 


set. 


The use of the jackknife for estimation of the vce has been largely 
superseded by the bootstrap. In recent research, Cattaneo, Jansson, and 
Ma (2019) find that the jackknife can reduce bias in two-step estimators with 
many covariates at the first step and propose subsequent inference based on 
a percentile-¢ bootstrap of the jackknife-based BC estimate. 


12.9.2 The vce(jackknife) option and the jackknife prefix 


For many estimation commands, the vce (jackknife) option can be used to 


obtain the jackknife estimate of the vce. For example, 


. * Jackknife estimate of standard errors 
. qui use mus212bootdata, replace 


. poisson docvis chronic, vce(jackknife, mse nodots) 


Poisson regression 


Log likelihood = -238.75384 


Jknife * 
docvis Coefficient std. err. t P>|t| 
chronic . 9833014 . 6222999 1.58 0.121 
_cons 1.031602 .3921051 2.63 0.011 


Number of obs = 50 
Replications = 50 
F(1, 49) = 2.50 
Prob > F = 0.1205 
Pseudo R2 = 0.0917 


[95% conf. interval] 


-.2672571 2.23386 
. 2436369 1.819566 


The jackknife estimate of the standard error of the coefficient of chronic is 
0.62, larger than the value 0.54 obtained by using the vce (bootstrap, 
reps (2000) ) option and the value 0.52 obtained by using the vce (robust) 


option; see the poisson example in section 12.3.4. 


The jackkni fe prefix operates similarly to bootstrap. 


12.10 Additional resources 


For many purposes, the vce (bootstrap) option of an estimation command 
suffices (see [R] vce_option) possibly followed by estat bootstrap. For 
more advanced analysis, the bootstrap prefix and the bsample command 
can be used. 


For applications that use more elaborate methods than those 
implemented with the vce (bootstrap) option, care is needed, and a good 
understanding of the bootstrap is recommended. References include Efron 
Davidson and MacKinnon (2004, chap. 4), Cameron and Trivedi (2005, 
chap. 11), and Hansen (2022a, chap. 10). For bootstrap with asymptotic 
refinement using the wild bootstrap, see section 12.6 for references and 
description of the community-contributed boottest command. 


12.11 Exercises 


1. Use the same data as those created in section 12.3.3, except keep the 
first 100 observations and keep the variables educ and age. After a 
Poisson regression of docvis on an intercept and educ, give default 
standard errors, robust standard errors, and bootstrap standard errors 
based on 1,000 bootstraps and a seed of 10101. 

2. For the Poisson regression in exercise 1, obtain the following 95% 
confidence intervals: normal-based, percentile, BC, and BCa. Compare 
these. Which, if any, is best? 

3. Obtain a bootstrap estimate of the standard deviation of the estimated 
standard deviation of docvis. 

4. Continuing with the regression in exercise 1, obtain a bootstrap 
estimate of the standard deviation of the robust standard error of Bs P 

5. Continuing with the regression in exercise 1, use the percentile-t 
method to perform a Wald test with asymptotic refinement of 
Ho: 8 = 0 against H, : 8 Æ 0 at the 0.05 level, and obtain a 
percentile-¢ 95% confidence interval. 

6. Use the data of section 12.3.3 with 50 observations. Give the 
command given at the end of this exercise. Use the data in 
percentile.dta to obtain the following for the coefficient of the 
chronic variable: 1) bootstrap standard error; 2) bootstrap estimate of 
bias; 3) normal-based 95% confidence interval; and 4) percentile-¢ 
95% confidence interval. For the last, you can use the centile 
command. Compare your results with those obtained from estat 
bootstrap, all after a Poisson regression with the vce (bootstrap) 
option. 


bootstrap bstar=_b[chronic], reps(999) seed(10101) nodots /// 
saving(percentile, replace): poisson docvis chronic 
use percentile, clear 


7. Continuing from the previous exercise, does the bootstrap estimate of 
the distribution of the coefficient of chronic appear to be normal? Use 
the summarize and kdensity commands. 

8. Repeat the percentile-t bootstrap at the start of section 12.5.2. Use 
kdensity to plot the bootstrap Wald statistics. Repeat for an estimation 


by poisson with default standard errors, rather than noreg. Comment 
on any differences. 


Chapter 13 
Nonlinear regression methods 


13.1 Introduction 


We now turn to nonlinear regression methods. In this chapter, we consider 
single-equation models fit using cross-sectional data with all regressors 
exogenous. 


Compared with linear regression, nonlinear regression presents two 
complications. There is no explicit solution for the estimator, so computation 
of the estimator requires iterative numerical methods. And, unlike a linear 
model, the marginal effect (ME) of a change in a regressor is no longer 
simply the associated slope parameter. For standard nonlinear models, the 
first complication is easily handled. Simply changing the command from 
regress y x tO poisson y x, for example, leads to nonlinear estimation and 
regression output that looks essentially the same as the output from regress. 
The second complication can often be dealt with by obtaining MEs using the 
margins command. 


In this chapter, we provide an overview of Stata’s nonlinear estimation 
commands and subsequent calculation of standard errors, prediction, and 
computation of MEs. The discussion is applicable for analysis after any Stata 
estimation command, including the commands listed in table 13.1. A 
complete list of estimation commands can be found using command help 
estimation commands, while help contents stat groups estimation 
commands by category. 


Table 13.1. Some estimation commands for linear and nonlinear cross- 
sectional models 


Model type 


Estimation command 


Linear 


Nonlinear least squares 


Binary 


Multinomial 


Ordinal 
Censored normal 
Selection normal 
Durations 


Counts 


Other nonlinear 


regress, cnsreg, areg, hetregress, ivregress, sem, eregress, 
etregress, qreg, boxcox, mvreg, sureg, reg3, mixed, xtreg, 
xtgls, xtrc, xtpcse, xtregar, xtivreg, xthtaylor, xtabond 

nl, nlsur, menl 

logit, logistic, probit, cloglog, hetprobit, ivprobit, 
heckprobit, eprobit, eteffects, melogit, meprobit, 
mecloglog, xtlogit, xtprobit, xtcloglog, xteprobit, 
biprobit 

mlogit, clogit, cmclogit, nlogit, cmmixlogit, mprobit, 
cmmprobit, mecloglog, slogit, cmxtmixlogit, xtlogit, 
xtprobit, xtmlogit 

ologit, oprobit, hetoprobit, heckoprobit, eoprobit, 
meologit, meoprobit, cmroprobit, cmrologit 

tobit, intreg, truncreg, ivtobit, eintreg, metobit, 
meintreg, xttobit, xttintreg, xtfrontier 

etregress, eintreg, heckman, xteregress, xteintreg, 
xtheckman 


stcox, stintcox, stcrreg, streg, stintreg, mestreg, xtstreg 


poisson, nbreg, gnbreg, cpoisson, tpoisson, heckpoisson, 
tnbreg, zip, zinb, ivpoisson, etpoisson, mepoisson, 
menbreg, xtpoisson, xtnbreg 


glm, gmm, meglm, gsem, xtgee, fmm, bayes 


The discussion of model-specific issues, particularly specification tests 


that are an integral part of the modeling cycle of estimation, specification 
testing, and reestimation, is presented in chapter 11 and in the model-specific 
chapters 17—22. Chapter 16 presents methods used to fit a nonlinear model, 
including methods used when no Stata command is available for that model. 


13.2 Nonlinear example: Doctor visits 


As a nonlinear estimation example, we consider Poisson regression to model 
count data on the number of doctor visits. There is no need to first read 
chapter 20 on count data because we provide any necessary background 
here. 


Although the outcome is discrete, the only difference this makes is in the 
choice of log density. The poisson command is actually not restricted to 
counts and can be applied to any variable y > 0. The points made with the 
count-data example could equally well be made with, for example, logit or 
probit models for binary outcomes or the Weibull model for duration data. 


13.2.1 Data description 


We use the dataset from the 2002 Medical Expenditure Panel Survey, first 
used in chapter 10. We model the number of office-based physician visits 
(docvis) by persons in the United States aged 25—64 years. The sample 
excludes those receiving public insurance (Medicare and Medicaid) and is 
restricted to those working in the private sector but who are not self- 
employed. 


The regressors used here are restricted to health insurance status 
(private), health status (chronic), and socioeconomic characteristics 
(female and income) to keep Stata output short. We have 


. x Read in dataset, select one year of data, and describe key variables 
. qui use mus210mepsdocvisyoung 


. keep if year02== 
(25,712 observations deleted) 


. describe docvis private chronic female income 


Variable Storage Display Value 

name type format label Variable label 
docvis int 28 .0g Number of doctor visits 
private byte 28 .0g Private insurance 
chronic byte 28 .0g Chronic condition 
female byte 48 .Og Female 


income float %9.0g Income in $ / 1000 


We then summarize the data: 


. * Summary of key variables 
. summarize docvis private chronic female income 


Variable Obs Mean Std. dev. Min Max 
docvis 4,412 3.957389 7.947601 (0) 134 
private 4,412 . 7853581 .4106202 (0) 1 
chronic 4,412 . 3263826 . 4689423 (0) 1 
female 4,412 .4718948 .4992661 (0) 1 
income 4,412 34.34018 29.03987 -49.999 280.777 


The dependent variable is a nonnegative integer count, here ranging from 0 
to 134. We see 33% of the sample have a chronic condition and 47% are 
female. We use the whole sample, including the three people who have 
negative income (obtained by using the tabulate income command). 


The relative frequencies of docvis, obtained by using the tabulate 
docvis command, are 36%, 16%, 10%, 7%, and 5% for, respectively, 0, 1, 2, 
3, and 4 visits, and 26% of the sample have 5 or more visits. 


13.2.2 Poisson model description 


The Poisson regression model specifies the count y to have a conditional 
mean of the exponential form 


E(y|x) = exp (x’G) (13.1) 


This ensures that the conditional mean is positive, which should be the case 
for any random variable that is restricted to be nonnegative. However, the 
key ME OE (y|x)/0x; = 8; exp(x’@) now depends on both the parameter 
estimate 3; and the particular value of x at which the ME is evaluated; see 
section 13.7. 


The starting point for count analysis is the Poisson distribution, with the 
probability mass function f(y|x) = e~“p¥ /y!. Substituting in 
Hi = exp(x; 6B) from (13.1) gives the conditional density for the jth 
observation. This in turn gives the log-likelihood function 


Q(B) = ya exp(x/3) + y:xiB—Iny;!}, which is maximized by the 
maximum likelihood estimator (MLE). The Poisson MLE solves the associated 
first-order conditions that can be shown to be 


N 
XC {yi — exp (x6) } xi = 0 (139) 


wl 


Equation (13.2) has no explicit solution for 3. Instead, 8 is obtained 
numerically by using methods explained in section 16.2. 


What if the Poisson distribution is the wrong distribution for modeling 
doctor visits? In general, the MLE is inconsistent if the density is 
misspecified. However, the Poisson MLE requires only the much weaker 
condition that the conditional mean function given in (13.1) be correctly 
specified, because then the left-hand side of (13.2) has an expected value of 
zero. Under this weaker condition, robust standard errors rather than default 
maximum likelihood (ML) standard errors should be used; see section 13.4.5. 


13.3 Nonlinear regression methods 


We consider four classes of estimators: ML, nonlinear least squares (NLS), 
generalized linear models (GLM), and generalized method of moments 
(GMM). 


The first three are examples of m estimators that maximize (or 
minimize) an objective function of the form 


N 
Q(9) = X ailyi, Xi, 0) (13.3) 
i=l 


where y denotes the dependent variable, x denotes regressors (assumed 
exogenous), @ denotes a parameter vector, and q(-) is a specified scalar 
function that varies with the model and estimator. In the Poisson case, 3 = 0 
; more generally, G is a component of 9. Separate treatment of GMM is given 
in section 13.3.9. 


13.3.1 MLE and quasi-MLE and robust standard errors 


MLEs maximize the log-likelihood function. For N independent observations, 
the MLE g maximizes 


N 
Q (0) = > m f (yilxi, 8) 


where f(y|x, 0) is the conditional density, for continuous y, or the 
conditional probability mass function, for discrete y. 


If the density f(y|x, 0) is correctly specified, then the MLE is the best 
estimator to use. It is consistent for Q, it is asymptotically normally 


distributed, and it is fully efficient, meaning that no other estimator of @ has 
a smaller asymptotic variance—covariance matrix of the estimator (VCE). 


Of course, the true density is unknown. If f(y|x, 0) is incorrectly 
specified, then in general the MLE is inconsistent. It may then be better to use 
other methods that, while not as efficient as the MLE, are consistent under 
weaker assumptions than those necessary for the MLE to be consistent. 


The MLE remains consistent even if the density is misspecified, however, 
provided that 1) the specified density is in the linear exponential family (LEF) 
and 2) the functional form for the conditional mean E (y|x) is correctly 
specified. The default estimate of the vce of the MLE is then no longer 
correct, so we base the inference on a robust estimate of the vce. Examples 
of the LEF are Poisson and negative binomial (with a known dispersion 
parameter) for count data, Bernoulli for binary data (including logit and 
probit), one-parameter gamma for duration data (including exponential), 
normal (with a known variance parameter) for continuous data, and the 
inverse Gaussian. 


It follows that consistency of the MLE for many standard models, notably 
the normal, Poisson, logit, probit, exponential, and gamma, requires only 
that the conditional mean function be correctly specified. For most other 
models, however, any distributional misspecification leads to inconsistency 
of the MLE. 


Furthermore, when the density is misspecified, default standard errors 
are incorrect, and robust standard errors should be used, regardless of 
whether the estimator remains consistent; see White (1982). 


The term quasi-MLE, or pseudo-MLE, is used when estimation is by ML, 
but subsequent inference is done without assuming that the density is 
correctly specified. Throughout this book, we use robust standard errors, 
unless there is reason not to do so. 


The mlexp command (see section 13.3.3) and more general optimization 
commands presented in chapter 16 enable ML estimation for user-defined 
likelihood functions. For commonly used models, this is not necessary, 


however, because specific Stata commands have been developed for specific 
models. 


13.3.2 The poisson command 


For Poisson, the ML estimator is obtained by using the poisson command. 
The syntax of the command is 


poisson depvar | indepvars | lif | lin] | weight ] lis options | 


This syntax is the same as that for regress. The only relevant option for our 
analysis here is the vce () option for the type of estimate of the VCE. 


The poisson command with the vce (robust) option yields the 
following results for the doctor-visits data. As already noted, to restrict Stata 
output, we use far fewer regressors than should be used to model doctor 
visits. 


. * Poisson regression (command poisson) 
. poisson docvis private chronic female income, vce(robust) 


Iteration 0: log pseudolikelihood = -18504.413 


Iteration 1: log pseudolikelihood = -18503.549 
Iteration 2: log pseudolikelihood = -18503.549 


Poisson regression Number of obs = 4,412 

Wald chi2(4) = 594.72 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -18503.549 Pseudo R2 = 0.1930 
Robust 

docvis | Coefficient std. err. z P>|z| [95% conf. interval] 

private . 1986652 . 1090014 7.33 0.000 . 5850263 1.012304 

chronic 1.091865 0559951 19.50 0.000 . 9821167 1.201614 

female -4925481 .0585365 8.41 0.000 . 3778187 6072774 

income -003557 0010825 3.29 0.001 0014354 . 0056787 

-cons -.2297262 . 1108732 -2.07 0.038 -.4470338 -.0124186 


The output begins with an iteration log because the estimator is obtained 
numerically by using an iterative procedure presented in sections 16.2 

and 16.3. In this case, only two iterations are needed. Each iteration 
increases the log-likelihood function, as desired, and iterations cease when 
there was little change in the log-likelihood function. The term 


“pseudolikelihood” is used rather than “log likelihood” because use of 
vce (robust) means that we no longer are maintaining that the data be 
exactly Poisson distributed. The remaining output from poisson 1s 
remarkably similar to that for regress. 


Here the four regressors are jointly statistically significant at 5% because 
the Wald chi2 (4) test statistic has p = 0.00 < 0.05. The pseudo- R2 is 
discussed in section 13.8.1. There is no analysis-of-variance table, because 
this table is appropriate only for linear least squares with independent 
homoskedastic errors. 


The remaining output indicates that all regressors are individually 
statistically significant at a level of 0.05 because all p-values are less than 
0.05. For each regressor, the output presents the following in turn: 


Coefficients B; 
Standard errors 53 
J. 
z Statistics Zj = Bilez, 
p-values p; = Pr [a| > O|z; ~ N(0,1)} 


95% confidence intervals B; + 1.96 x s3, 


The z statistics and p-values are computed by using the standard normal 
distribution, rather than the ¢ distribution with N — K degrees of freedom. 
The p-values are for a two-sided test of whether 8; = 0. For a one-sided test 
of Ho: 6; < 0 against 8; > 0, the p-value is half of that reported in the 
table, provided that z; > 0. For a one-sided test of Ho: 8; > 0 against 

B; < 0, the p-value is half of that reported in the table, provided that z; < 0. 


A nonlinear model raises a new issue of interpretation of the slope 
coefficients 3;. For example, what does the value 0.0036 for the coefficient 
of income mean? Given the exponential functional form for the conditional 
mean in (13.1), it means that a $1,000 increase in income (a one-unit 
increase in income) leads to a 0.0036 proportionate increase, or a 0.36% 


increase, in doctor visits. We address this important issue in detail in 
section 13.7. 


Note that test statistics following nonlinear estimation commands such as 
poisson are based on the standard normal distribution and chi-squared 
distributions, whereas those following some linear estimation commands, 
notably the regress and xtreg, fe commands, use the ¢ and F 
distributions. For independent observations, this makes little difference for 
larger samples, say, N > 100. For clustered observations with few clusters, 
this can make a big difference, even if there are many observations; see 
section 6.4.6. The postestimation command test with option df() uses the 
F distribution. 


13.3.3 The mlexp command 


Stata provides several optimization commands that enable ML estimation for 
a user-provided parametric model; see chapter 16. The simplest command is 
the mlexp command, suitable for independent observations with parameters 
entering through indexes such as x‘. 


The command syntax is 
mlexp (lexp) [ of | [ in | | weight | [ options | 


where /exp is a substitutable expression (see section 13.3.6) for the log 
density of a single observation. The only relevant option for our analysis 
here is the vce () option for the type of estimate of the VCE. 


The challenge is in defining the expression for the log density; [R] mlexp 
provides many examples. For the Poisson model, the log density is 
f (yi|xi) = — exp(x;, 8) + yix; B — Iny;!. An explicit definition for our 
example is the command 


* ML estimation (command mlexp) for Poisson model 
. mlexp (-exp({xb: private chronic female income _cons}) + docvis*{xb:} - 


> Infactorial(docvis)), vce(robust) 

initial: log pseudolikelihood = -33899.609 
alternative: log pseudolikelihood = -28031.767 
rescale: log pseudolikelihood = -24020.669 
Iteration 0: log pseudolikelihood = -24020.669 
Iteration 1: log pseudolikelihood = -23995.423 
Iteration 2: log pseudolikelihood = -18539.168 
Iteration 3: log pseudolikelihood = -18503.596 
Iteration 4: log pseudolikelihood = -18503.549 


Iteration 5: log pseudolikelihood = -18503.549 


Maximum likelihood estimation 


Log pseudolikelihood = -18503.549 Number of obs = 4,412 
Robust 

Coefficient std. err. Zz P>|z| [95% conf. interval] 

xb 
private . 7986654 . 1090015 7.33 0.000 . 5850265 1.012304 
chronic 1.091865 .0559951 19.50 0.000 .9821167 1.201614 
female .4925481 .0585365 8.41 0.000 . 3778187 .6072775 
income .003557 .0010825 3.29 0.001 .0014354 .0056787 
_cons - .2297264 . 1108733 -2.07 0.038 - .4470339 -.0124188 


The results are the same as those for the poisson command. 


Compared with the more general ml command and with the functions 
optimize() and moptimize(), the mlexp command has the advantages of 
supporting most postestimation commands such as the margins command. 


13.3.4 Postestimation commands 


The ereturn list command details the estimation results that are stored in 
e (); see section 1.6.2. These include regression coefficients in e (œ) and the 
estimated VCE in e (Vv). 


Standard postestimation commands available after most estimators are 
predict, predictnl, and margins for prediction and Mes (this chapter); 
test, testnl, lincom, and nlcom for Wald tests and confidence intervals; 
linktest for a model-specification test (section 11.3); and estimates for 
storing results (section 3.5.6). 


The estat vce command displays the estimate of the VCE, and the 
correlation option displays the correlations for this matrix. The estat 
summarize command summarizes the current estimation sample. The estat 
ic command obtains information criteria (section 13.8.2). More task-specific 
commands, usually beginning with estat, are available for model- 
specification testing. 


Table 3.1 summarizes basic postestimation commands for regress that 
are generally also available after nonlinear model estimation commands. To 
find the specific postestimation commands available after a command, for 
example, poisson, see [R] poisson postestimation or type help poisson 
postestimation. 


13.3.5 Nonlinear least squares 


NLS estimators minimize the sum of squared residuals, so for independent 
observations, the NLS estimator @ minimizes 


Q(B) = > {vi —m/(x;, B) F 


where m(x, (3) is the specified functional form for E'(y|x), the conditional 
mean of y given x. 


If the conditional mean function is correctly specified, then the NLS 
estimator is consistent and asymptotically normally distributed. If the data- 
generating process (DGP) is y; = m(x;, B) + u;, where u; ~ N(0, 07), then 
NLS is fully efficient. If u; ~ [0, 77], then the NLs default estimate of the VCE 
is correct; otherwise, a robust estimate should be used. 


13.3.6 The nl command 
The nı command implements NLS regression. The simplest form of the 


command directly defines the conditional mean rather than calling a program 
or function. The syntax is 


nl (depvar=<sexp>) [ of | [ in | | weight | E options | 


where <sexp> is a substitutable expression for the conditional mean. The 
only relevant option for our analysis here is the vce () option for the type of 
estimate of the VCE. 


There are several ways to define the expression for the conditional mean 
exp(x’3); see [R] nl. We present two ways to do so. 


The first example of n1 uses a lengthier expression that gives the 
parameter names in braces, {}. Additionally, to obtain an analysis-of- 
variance table, we suppress the usual use of the vce (robust) option. 


. * Nonlinear least-squares regression (command nl) with default standard errors 
. nl (docvis = exp({private}*private + {chronic}*chronic 
> + {female}*female + {income}*income + {intercept})) 


Iteration 0: residual SS = 251743.9 
Iteration 1: residual SS = 242727.6 
Iteration 2: residual SS = 241818.1 
Iteration 3: residual SS = 241815.4 
Iteration 4: residual SS = 241815.4 
Iteration 5: residual SS = 241815.4 
Iteration 6: residual SS = 241815.4 
Iteration 7: residual SS = 241815.4 
Iteration 8: residual SS = 241815.4 
Source SS df MS 
Number of obs = 4,412 
Model 105898 .64 5 21179.7289 R-squared = 0.3046 
Residual 241815.36 4407 54.870741 Adj R-squared = 0.3038 
Root MSE = 7.407479 
Total 347714 4412 78.8109701 Res. dev. = 30185.68 
docvis | Coefficient Std. err. t P>|t [95% conf. interval] 
/private .7105104 . 1170408 6.07 0.000 -4810517 . 939969 
/chronic 1.057318 .0610386 17.32 0.000 .9376517 1.176984 
/female . 4320225 .0523199 8.26 0.000 . 3294491 .5345958 
/income .002558 .0006941 3.69 0.000 .0011972 .0039189 
/intercept -.040563 . 1272218 -0.32 0.750 -.2899816 . 2088557 


The nı coefficient estimates are similar to those from poisson (within 15% 
for all regressors except income). The model diagnostic statistics given 
include R2 computed as the model (or explained) sum of squares divided by 
the total sum of squares, the root mean squared error (MSE) that is the 


estimate s of the standard deviation o of the model error, and the residual 
deviance defined in section 13.8.3 that is a goodness-of-fit measure used 
mostly in the GLM literature. 


We next give a shorter equivalent expression for the conditional mean 
function. Also, the vce (robust) option is used to allow for heteroskedastic 
errors, and the nolog option is used to suppress the iteration log. We have 


. * Nonlinear least-squares regression - alternative form of commannd nl 
. nl (docvis = exp({xb: private chronic female income} + {intercept})), 
> vce(robust) nolog 


Nonlinear regression Number of obs = 4,412 
R-squared = 0.3046 
Adj R-squared = 0.3038 
Root MSE = 7.407479 
Res. dev. = 30185.68 

Robust 
docvis | Coefficient std. err. t P>|t| [95% conf. interval] 
/xb_private . 7105104 . 1086194 6.54 0.000 .4975618 . 923459 
/xb_chronic 1.057318 .0558352 18.94 0.000 . 947853 1.166783 
/xb_female . 4320225 . 0694662 6.22 0.000 . 2958338 .5682112 
/xb_income .002558 .0012544 2.04 0.041 . 0000988 .0050173 
/intercept -.040563 .1126215 -0.36 0.719 -.2613578 . 1802319 


The output is the same except for the standard errors, which are now robust 
to heteroskedasticity. Compared with the poisson command robust standard 
errors, the n1 robust standard errors are 19% higher for female and 16% 
higher for income and are similar for the remaining regressors. 


13.3.7 Generalized linear models 


The GLM framework is the standard nonlinear model framework in many 
areas of applied statistics, most notably, biostatistics. 


The GLM is little used in econometrics. However, for structural 
parametric nonlinear models such as nonlinear models with selection or 
nonlinear multilevel models, the Stata command gsem, detailed in 
section 23.6, is restricted to GLMs, in which case some familiarity with GLMs 
is useful. 


GLM estimators are a subset of ML estimators that are based on a density 
in the LEF, introduced in section 13.3.1. They are essentially generalizations 
of NLS, optimal for a nonlinear regression model with homoskedastic 
additive errors, to other types of data where there is not only intrinsic 
heteroskedasticity but also a natural starting point for modeling the intrinsic 
heteroskedasticity. For example, for the Poisson, the variance equals the 
mean, and for a binary variable, the variance equals the mean times unity 
minus the mean. 


A quite general GLM estimator g@ maximizes the LEF log likelihood 


N 


Q(8) = X la{m(x:, B)} + bys) + efm(xi, 8) yi 


1=1 


where m(x, 3) = E(y|x) is the conditional mean of y, different specified 
forms of the functions a(-) and c(-) correspond to different members of the 


LFF, and b(-) is a normalizing constant. For the Poisson, a(u) = — and 
c(u) = In. 
Given definitions of waly ) and c(u), the mean and variance are 
/ 


necessarily E(y) = a’ (w)/c’ (u) and Var(y) = 1/c' (u). For the 
Poisson, a'(u) = —1 id c' (u) = 1/p, so E(y) = 1/(1/p) = u and 
Var(y) = 1/c’(u) = 1/(1/p) = u. This is the variance—mean equality 
property of the Poisson. 


GLM estimators have the important property that in basic cross-sectional 
applications they are consistent provided only that the conditional mean 
function is correctly specified. This result arises because the first-order 
conditions 0Q(0@)/00 = 0 can be written as 


where u; = m(x;, B). It follows that estimator consistency requires only 
that E(y; — ui) = 0 or that E'(y;|x;) = m(x;, B). However, unless the 
variance is correctly specified [that is, Var(y) = 1/c’()], we should obtain 
a robust estimate of the VCE. 


13.3.8 The glm command 


The GLM estimator can be computed by using the g1m command. This 
command is restricted to a conditional mean function that is of single-index 
form, and, for historical reasons, models are defined in terms of the inverse 
of the conditional mean function, called the link function. Thus, a GLM 
specifies that Yi has conditional mean 


E(yi|x:) = 97 * (x;8) 


where g(-) is called the link function. For example, for Poisson regression, 
E(y:|xi) = exp(x;;3), and the link function g(-) is In(-), the inverse of the 
exponential function. 


The glm command has the syntax 
glm depvar | indepvars | [ of | [ in | [ weight | B options | 


The key options are family () to define the particular member of the LEF to 
be considered and link (), where the link function is the inverse of the 
conditional mean function. The family () options are gaussian (normal), 
igaussian (inverse Gaussian), binomial (Bernoulli and binomial), poisson 
(Poisson), nbinomial (negative binomial), and gamma (gamma). Different 
families permit different link functions. 


The Poisson estimator can be obtained by using the options 
family (poisson) and link (log). The link function is the natural logarithm 
because this is the inverse of the exponential function for the conditional 
mean. We again use the vce (robust) option. We expect the same results as 
those from poisson with the vce (robust) option. 


* Generalized linear models regression for poisson (command glm) 


. glm docvis private chronic female income, 


Number of obs 
Residual df 
Scale parameter 
(1/df) Deviance 
(1/df) Pearson 


> family(poisson) link(log) vce(robust) nolog 
Generalized linear models 

Optimization : ML 

Deviance = 28131.11439 

Pearson = 67126.23793 

Variance function: V(u) = u 

Link function : g(u) = Inu) 


Log pseudolikelihood = -18503.54883 


docvis Coefficient std. 
private . 7986653 . 1090014 
chronic 1.091865 .0559951 
female .4925481 .0585365 
income .003557 .0010825 


_cons 


Robust 


- .2297263 . 1108733 


[Poisson] 

[Log] 

AIC = 

BIC = 
P>|z| [95% conf. 
0.000 . 5850264 
0.000 . 9821167 
0.000 .3778187 
0.001 .0014354 
0.038 - .4470339 


4,412 
4,407 

1 
6.38328 
12.96261 


8.390095 
-8852.797 


interval] 


1.012304 
1.201614 
. 6072774 
. 0056787 
-.0124187 


The results are exactly the same as those given in section 13.3.2 for the 
Poisson quasi-MLE, aside from additional diagnostic statistics (deviance, 
Pearson) defined in section 13.8.3 that are used in the GLM literature. Robust 
standard errors are used because they do not impose the Poisson density 


restriction of variance—mean equality. 


A standard statistics reference is McCullagh and Nelder (1989); Hardin 
and Hilbe (2018) present Stata for GLM, and an econometrics reference that 


covers GLM in some detail is Cameron and Trivedi ( 


13.3.9 Generalized method of moments 


2013). 


GMM estimators minimize an objective function that is more complicated 
than (13.3) because it is a quadratic form in sums. 


The GMM begins with the population moment conditions 


E{h(w;,0)} 


(13.4) 


where @ is a q x 1 vector, h(-) is an r x 1 vector function with r > q, and 
the vector w; represents all observables, including the dependent variable, 
regressors, and, where relevant, instrumental variables (Iv). A leading 
example is linear Iv (see section 7.3), where h(w;, 0) = z;(y; — x;@). 


If r = q, then the method-of-moments (Mm) estimator @,,,, Solves the 
corresponding sample moment condition N-t 5°, h(w;,@) = 0. This is not 
possible if r > q, such as for an overidentified linear rv model, because there 
are then more equations than parameters. 


The GMM estimator gumy Minimizes a quadratic form in X`, h(w;, 0), 
with the objective function 


Q(8) = {domo wfS howo (13.5) 


where the r x r weighting matrix W is positive-definite symmetric and 
possibly stochastic with a finite probability limit and does not depend on @. 
The MM estimator, the special case r = q, can be obtained most simply by 
letting W = I, or any other value, and then Q(@) = 0 at the optimum. 


Provided that condition (13.4) holds, the GMM estimator is consistent for 
@ and is asymptotically normal with the robust estimate of the VCE 


P (onm) = (Gwe) © G'wSwG (Gwe) © 


where, assuming independence over i, 


(13.6) 


oe eee 
For MM, the variance simplifies to (G@/S-!G regardless of the 


choice of W. For GMM, different choices of W lead to different estimators. 
The best choice of W is w — S-!, in which case again the variance 
simplifies to (G S-1G)-1. For linear Iv, an explicit formula for the 


estimator can be obtained; see section 7.3. 
13.3.10 The gmm command 


GMM estimators minimize an objective function that is a quadratic form in 
sums; see (13.5). Optimization is more complicated than the single sum for 
m-estimators given in (13.3). For some linear models, there are official Stata 
commands, notably, ivregress gmm for cross-sectional data and xt abond for 
dynamic panel data. There are no official commands for specific nonlinear 
models. Instead, we use the gmm command. 


The simplest form of the gmm command directly defines the conditional 
mean and has the syntax 


gmm (| egqname1: |<mezp_1>) (| egqname2: |<mezp_2>) eas [ af | [ in | | weight | 
FE options | 


where <mexp_j> is a substitutable expression for the jth moment equation. 
Options include instruments () to define the instruments; one of the 
onestep, twostep (the default), and igmm options for, respectively, one-step, 
two-step, and iterated GMM estimation; and wmatrix() to define the 
weighting matrix if estimation is by two-step or iterated GMM for an 
overidentified model. For a just-identified model, these different estimation 
methods lead to the same estimates. For models that are more complicated to 
specify, we use a variant of the gmm command that references a separate user- 
written program that defines the moment conditions. 


Postestimation commands following gmm include estat overid for an 
overidentified model and the margins command; see the example below for 
an example of computation of MEs. 


We apply the gmm command to the nonlinear Iv estimator for the Poisson 
model with an endogenous regressor. The Poisson regression model specifies 
that E{y — exp(x’)|x} = 0 because E'(y|x) = exp(x’@). Suppose instead 
that E{y — exp(x’G)|x} 4 0, because of endogeneity of one or more 
regressors, but that there are instruments z such that 


E [zi {ys — exp (x;3)}] = 0 (13.7) 


This defines the moment condition in (13.4), and the GMM estimator then 
minimizes the quadratic form 


Q(B) = x 3 Zit Yi — a) W E > Zi{yi — exp(x;3)} 


where different estimation methods and choices of the weighting matrix W 
lead to different estimates if the model is overidentified (here z; has more 
entries than x;). 


We use the dependent and independent variables to define a substitutable 
expression {y; — exp(x/3)}, using a syntax that is similar to that for the n1 
command. We use the instruments () option to define the variables to be 
used in the instruments Z;. We continue with the number of doctor-visits 
regression example, except that we now treat the regressor private as 
endogenous, with single instrument firmsize (measured in hundreds of 
employees). We have 


. * Command gmm for GMM estimation (nonlinear IV) for Poisson model 
. gmm (docvis - exp({xb:private chronic female income _cons})), 
> instruments(firmsize chronic female income) onestep nolog 


Final GMM criterion Q(b) = 1.29e-17 
note: model is exactly identified. 


GMM estimation 


Number of parameters = 5 
Number of moments = 5 
Initial weight matrix: Unadjusted Number of obS = 4,412 
Robust 
Coefficient std. err. z P>|z| [95% conf. interval] 
private 1.340292 1.559015 0.86 0.390 -1.715322 4.395905 
chronic 1.072908 .0762684 14.07 0.000 . 9234242 1.222391 
female .4778178 . 0690393 6.92 0.000 . 3425032 .6131323 
income . 0027833 .002192 1.27 0.204 -.0015129 .0070795 
_cons - .6832462 1.349606 -0.51 0.613 -3.328424 1.961932 


Instruments for equation 1: firmsize chronic female income _cons 


By default, robust standard errors are computed. The biggest change 
compared with the Poisson regression output given in section 13.3.2 is for 
the endogenous regressor private. This regressor is now much less 
precisely estimated, with a standard error of 1.559 compared with 0.109 and 
a coefficient that has increased substantially from 0.798 to 1.340, though it is 
now statistically insignificant. Similar efficiency loss in estimation of 
endogenous regressors often occurs with linear Iv estimation using cross- 
sectional data; see the example in section 7.4.6. 


The following command, explained in section 13.7.8, computes the 
average marginal effects (AMEs) of changes in the expected number of doctor 
visits as regressors change. 


* Computation of AMEs following for this gmm example 
. margins, dydx(*) expression(exp(predict (xb) )) 


(output omitted ) 


The preceding example was a just-identified model with the same 
number of instruments as regressors. Thus, the minimized value of the 
objective function is zero (here 1.29 x 10717, reflecting numerical rounding 
error). For application of the gmm command to an overidentified model, see 
section 20.7.3. Also, [R] gmm provides many more examples. 


13.3.11 The gmm command for two-step estimators 


The preceding example entailed estimation based on the single moment 
orthogonality condition (13.7). The gmm command can also be applied to 
multiple moment conditions. 


As an example, we consider the control function estimator for linear 
regression with a single endogenous regressor (see section 7.4.7), which is 
an example of a two-step sequential estimator. The structural model is 
y = x'@3+u, where x includes the endogenous variable y2. The first-stage 
model is yo = z’ + v, where z includes the exogenous regressors in the 
structural model y and additional instruments necessary for identification. 


The control function estimator first ordinary least-squares (OLS) regresses 
y2 on z, yielding residual © = ys — z’7r, and then OLS regresses Yy on x and G 
, ylelding estimates that can be shown to be equal to the two-stage least- 
squares (2SLS) estimates. The corresponding sample moment conditions are 


>, Zilyzi — zT) = 0 (13.8) 


D P an {yi — XB — (Yai — 247) } = 0 


i 


The gmm command requires that the instruments provided in the 
instruments () option depend only on observed variables, and not unknown 
parameters. To enable this, we reexpress (13.8) as 


X zi(yzi — zT) = 0 
X xi{ys — X18 — (yrs — ziT)} = 0 


SD1 x [yar — zim){yi — x18 — Yuzi — zim)}] = 0 


4 


so the instruments are, respectively, z, x, and 1. 


We apply this method to the example of section 7.4.7. We first define 
globals that make the code easier to read. 


* Data and globals for gmm estimation of chapter 7 control function example 
. qui use mus207mepspresdrugs, clear 


. global y1 ldrugexp // Dependent variable 

. global y2 hi_empunion // Endogenous regressor 

. global x2list totchr age female blhisp linc // Exogenous regressors 
. global xlist ssiratio $x2list // Structural equation regressors 


. global zlist $y2 $x2list // First-stage regressors 


We then obtain the estimates 


* Command gmm for estimation of control function linear model 


VVVVV MV 


instrumen 


Final GMM criterion Q(b) = 


note: model is 
GMM estimation 


Number of param 
Number of momen 


ts(eq3: ) 


gmm (eqi: ($y2 - {zpi:ssiratio $x2list _cons} ) ) 
(eq2: ($y1 - {xb:$zlist _cons} - 
{gamma}*($y2 - {zpi:}) ) ) 
(eq3: (($y2 -{zpi:}) * ($y1 - {xb:} 
-{gamma}*($y2 - {zpi:})) ) ), 
instruments (eq1: $xlist) instruments(eq2: $zlist) 


onestep winitial(unadjusted, independent) nolog 


2.96e-32 


exactly identified. 


eters 


15 
ts = 15 


Initial weight matrix: Unadjusted 


zpi 
ssiratio 
totchr 
age 
female 
blhisp 
linc 
_cons 


xb 
hi_empunion 
totchr 
age 
female 
blhisp 
linc 
_cons 


/gamma 


Coefficient 


-.2205333 
.0133494 
- .0084926 
-.0721048 
- .0609939 
.0454804 
1.039144 


-.8115015 
. 4492135 
-.0122415 
-.0140507 
- . 2089573 
.0815472 
6.692548 


. 9039114 


Robust 


std. err. 


.0151123 
. 0036373 
. 0007038 
. 0096345 

.012235 
.0061999 
.0574802 


. 1884024 
.0100423 
.0027711 
.0310339 
. 0383529 
.0207576 
. 2449266 


. 1900969 
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. 000 
. 000 
. 000 
. 000 
.000 
. 000 
. 000 


. 000 
.000 
. 000 
.651 
. 000 
. 000 
. 000 


. 000 


of obs = 


[95% conf. 


-.2501529 
. 0062205 
-.0098721 
-.0909881 
-.0849741 
. 0333288 
. 9264845 


-1.180763 
429531 
-.0176727 
-.0748761 
-.2841276 
0408631 
6.212501 


.5313283 


10,068 


interval] 


. 1909137 
.0204783 
0071131 
0532215 
0370137 

.057632 
1.151803 


- .4422396 
. 468896 
- .0068102 
. 0467747 
-.1337869 
. 1222314 
7.172595 


1.276495 


Instruments for 
Instruments for 
Instruments for 


The estimates are identical to those in section 7.4.7 
heteroskedastic—robust standard errors that control for the first-stage 


equation eqi: ssiratio totchr age female blhisp linc _cons 
equation eq2: hi_empunion totchr age female blhisp linc _cons 
equation eq3: _cons 


, except the correct 


estimation are computed. For the linear model, the control function estimator 
equals the 2SLs estimator, and, as expected, the estimates and standard errors 
are identical to those for 2SLS given in section 7.4.4. 


This example extends to control function estimators in nonlinear models 
with an endogenous regressor. More generally, the method can be applied to 
two-step estimators, such as Heckman’s two-step estimator for the sample- 
selection model in section 19.6.4. Drukker (2014) outlines this method using 
the gmm command for a treatment-effects example. The method is used for 
many of Stata’s treatment-effects estimators; see section 24.6.12. 


13.3.12 Other estimators 


The preceding part of this chapter covers most of the estimators used in 
microeconometrics analysis using cross-sectional data. We now consider 
some nonlinear estimators that are not covered. 


One approach is to specify a linear function for E{h(y)|x}, so 
E{h(y)|x} = x’B, where h(-) is a nonlinear function. An example is the 
Box—Cox transformation in section 3.6. A disadvantage of this alternative 
approach is the transformation bias that arises if we then wish to predict y or 


E(y|x). 


Nonparametric and semiparametric estimators do not completely specify 
the functional forms of key model components such as E:(y|x). Several 
methods for nonparametric regression of y on a scalar x, including the 
lpoly, npregress, and lowess commands, are introduced in sections 2.6.6 
and 14.6, and chapter 27 provides a more thorough treatment. Nonlinear 
methods for clustered data are summarized in section 13.9, and nonlinear 
methods for panel data are presented in chapter 22. 


13.4 Different estimates of the VCE 


Given an estimator, there are several different standard methods for 
computation of standard errors and subsequent test statistics and confidence 
intervals. The most commonly used methods yield default, robust, and 
cluster—robust standard errors. This section extends the results in section 3.4 
for the OLS estimator to nonlinear estimators. 


13.4.1 General framework 


We consider inference for the estimator g of a q x 1 parameter vector Q that 
solves the q equations 


N 
Ņ\ gi (6) =0 (13.9) 
wl 


where g;(-) is a q x 1 vector. For m estimators defined in section 13.3, 
differentiation of objective function (13.3) leads to first-order conditions 
with g;(0) = 0qi(yi, Xi, 0)/30. It is assumed that 


E{g;(9)} =0 
a condition that for standard estimators is necessary and sufficient for 
consistency of @. 
This setup covers most models and estimators, with the notable 
exception of the overidentified 2SLs and GMM estimators presented in 


chapter 7 in the linear case, and section 13.3.9 in the nonlinear case. 


Under appropriate assumptions, it can be shown that 


62N fo, Var (6) } 
where Var(6) denotes the (asymptotic) vce. Furthermore, 
Var (6) = 


{5mo} etrme') i (13.10) 


E : Teoma} 


where H;(0) = 0g;/00’. This general expression for Var(6) is said to 
be of “sandwich form” because it can be written as A-!1BRA’—!, with B 
sandwiched between A-1 and A/—1. OLS is a special case with 
g;(B) =X, = (yi = x! 3) and H; (6) = x;x}. 


We wish to obtain the estimated asymptotic variance matrix V(6) and 
the associated standard errors, which are the square roots of the diagonal 
entries of V (6). This obviously entails replacing @ with g. The first and 
third matrices in (13.10) can be estimated using A = X, H;(@). But 


estimation of E Z >; gi(0)g; (0) \ requires additional distributional 


assumptions, such as independence over į and possibly a functional form for 


~ ~ 


E {gi(@)gi(@)’}. [Note that the obvious 5°, $, g:(8)g;(0)' = 0 because 
from (13.9) >, g;(6) = 0.1 


13.4.2 The vce() option 


Different assumptions lead to different estimates of the vce. They are 
obtained by using the vce (vcetype) option for the estimation command 
being used. The specific vcetypes available vary with the estimation 
command. Their formulas are detailed in subsequent sections. 


For the poisson command, many vcetypes are supported. 


The vce (oim) and vce (opg) options use the DGP assumptions to evaluate 
the expectations in (13.10); see section 13.4.4. The vce (oim) option is the 


default. 


The vce (robust) and vce (cluster clustvar) options use sandwich 
estimators that do not use the DGP assumptions to explicitly evaluate the 
expectations in (13.10). The vce (robust) option assumes independence 
over į. The vce (cluster clustvar) option permits a limited form of 
correlation over 7, within clusters, where the clusters are independent and 
there are many clusters; see section 13.4.6. For commands that already 
control for clustering, such as xt reg, the vce (robust) option provides a 
cluster—robust estimate of the VCE. 


The vce (bootstrap) and vce (jackknife) options use resampling 
schemes that make limited assumptions on the DGP similar to those for the 
option vce (robust) Of vce (cluster clustvar); see section 13.4.8. 


The various vce () options need to be used with considerable caution. 
Estimates of the vcE other than the default estimate are used when some part 
of the Dap is felt to be misspecified. Because this is likely to be the case, it is 
standard to use default standard errors that are robust, usually 
heteroskedastic robust for independent observations and cluster—robust for 
clustered observations. At the same time, in many instances, 
misspecification of the DGP leads to inconsistent parameter estimates; see 
section 13.3.1. 


13.4.3 Application of the vce() option 


For count data, the natural starting point is the MLE, assuming a Poisson 
distribution. It can be shown that the default ML standard errors are based on 
the Poisson distribution restriction of variance—mean inequality. But in 
practice, count data are often “overdispersed” with Var(y|x) > exp(x’), in 
which case the default ML standard errors can be shown to be biased 
downward. At the same time, the Poisson MLE can be shown to be consistent 
provided only that E(y|x) = exp(x’@) is the correct specification of the 
conditional mean. 


These considerations make the Poisson MLE a leading candidate for using 
the option vce (robust) rather than the default. The vce (cluster 
clustvar) option assumes independence over clusters, however clusters are 


defined, rather than independence over į. The vce (bootstrap) estimate is 
asymptotically equivalent to the vce (robust) estimate. 


For the Poisson MLE with K parameters, it can be shown that the default, 
robust, and cluster—robust estimates of the VCE are given by, respectively, 


Ve: (a) = a Pa) 1 
i 2 p 
weet) pele a ead 
“1 -1 
(Gr) (cerze) ES 


where £, = J jjcc(Yi — exiB) x, c=1,...,C denotes the clusters and 
N/(N — K) and C/(C — 1) are degrees-of-freedom adjustments used by 
Stata. 


Implementation is straightforward, except that in this example, there is 
no natural reason for clustering. For illustrative purposes, we cluster on age, 
in which case we are assuming correlation across individuals of the same age 
and independence of individuals of different age. For the bootstrap, we first 
set the seed, for replicability, and set the number of replications at 400, 
considerably higher than the Stata default. We obtain 


. * Different VCE estimates after Poisson regression 
. qui use mus210mepsdocvisyoung, clear 


. keep if year02== 
(25,712 observations deleted) 


. qui poisson docvis private chronic female income 

. estimates store VCE_oim 

. qui poisson docvis private chronic female income, vce(opg) 

. estimates store VCE_opg 

. qui poisson docvis private chronic female income, vce(robust) 

. estimates store VCE_rob 

. qui poisson docvis private chronic female income, vce(cluster age) 

. estimates store VCE_clu 

. set seed 10101 

. qui poisson docvis private chronic female income, vce(bootstrap, reps(400) ) 
. estimates store VCE_boot 

. estimates table VCE_oim VCE_opg VCE_rob VCE_clu VCE_boot, b(%8.4f) se 


Variable | VCE_oim VCE_opg VCE_rob VCE_clu VCE_boot 
private 0.7987 0.7987 0.7987 0.7987 0.7987 
0.0277 0.0072 0.1090 0.1496 0.1039 

chronic 1.0919 1.0919 1.0919 1.0919 1.0919 
0.0158 0.0046 0.0560 0.0603 0.0566 

female 0.4925 0.4925 0.4925 0.4925 0.4925 
0.0160 0.0046 0.0585 0.0686 0.0594 

income 0.0036 0.0036 0.0036 0.0036 0.0036 
0.0002 0.0001 0.0011 0.0012 0.0011 

_cons -0.2297 -0.2297 -0.2297 -0.2297 -0.2297 
0.0287 0.0075 0.1109 0.1454 0.1044 


Legend: b/se 


The first two ML-based standard errors, explained in section 13.4.4, are very 
different. This indicates a problem with the assumption of a Poisson density. 
The third-column robust standard errors are roughly four times the first- 
column default standard errors. This very large difference often happens 
when fitting Poisson models. For other estimators, the difference is usually 
not as great. In particular, for OLS, robust standard errors are often within 
20% (higher or lower) of the default. The fourth-column cluster—robust 
standard errors are 8—37% higher than the robust standard errors. In other 
applications, the difference can be much larger; see section 3.4.6 for a rule of 
thumb in the linear case. The fifth-column bootstrap standard errors are 


within 1% of the third-column robust standard errors, confirming that they 
are essentially equivalent. 


In this example, it would be misleading to use the default standard errors. 
We should at least use the robust standard errors. This requires relaxing the 
assumption of a Poisson distribution so that the model should not be used to 
predict conditional probabilities. But, at least for the Poisson MLE, B isa 
consistent estimate provided that the conditional mean is indeed the 
specified exp(x’@). 


13.4.4 Default estimate of the VCE 


If no option is used, then we obtain “default” standard errors. These make 
the strongest assumptions, essentially that all relevant parts of the DGP are 
specified and are specified correctly. This permits considerable 
simplification, not given here, that leads to the sandwiched form A-!1BA’~! 
simplifying to a multiple of 4-1. 


For the MLE (with data independent over 7), it is assumed that the density 
is correctly specified. Then the information matrix equality leads to 
simplification so that 


Paes (8) = - = E{H;(0 D j 


The default vce (oim) option, where oim is an acronym for the observed 
information matrix, gives this estimate of the vce for Stata ML commands. 
This estimator is also the default for Pearse in which case it goes under the 
name vce (ols), yielding Vao¢(B) = s? (So, x;x/)~* with 


s2 =), 02/(N — K). 


The vce (opg) option gives an alternative estimate, called the outer 
product of the gradient estimate: 


Fe (8) = {Se (8) « (0) } 


This is asymptotically equivalent to the default estimate if the density is 
correctly specified. 


13.4.5 Heteroskedastic-robust estimate of the VCE 


The vce (robust) option to cross-sectional estimation commands calculates 
the sandwich estimate under the assumption of independence. Then 

E{y D2; g:(0)g;(0)'} = ELD); g:(8)g:(0)'}, leading to the vce robust 
estimate 


t0- (pa) (ere) (EH) 


where H; = H; (6) and g; = g; (0). In some special cases, such as NLS, f, 
is replaced by the expected Hessian E(H;) evaluated at g. The factor 
N/(N — K) in the middle term is an ad hoc degrees-of-freedom adjustment 
analogous to that for the linear regression model with independent and 
identically distributed normal errors. This estimator is a generalization of 
similar results of Huber (1967) for the MLE and the heteroskedasticity 
consistent estimate of White (1980) for the OLS estimator. It is often called 
heteroskedasticity robust rather than robust. 


The vce (robust) option should be used with caution. It is robust in the 
sense that, unlike default standard errors, no assumption is made about the 
functional form for F{g;(@)g;(0)'}. But if E{g;(0)g;(@)’} is misspecified, 
warranting use of robust standard errors, then it may also be the case that 
E{g;(@)} 4 0. Then we have the much more serious problem of @ being 
inconsistent for g. For example, the tobit MLE and the MLE for any other 
parametric model with selection or truncation become inconsistent as soon 
as any distributional assumptions are relaxed. The only advantage then of 


using the robust estimate of the vce is that it does give a consistent estimate 
of the vce. However, it is the VCE of an inconsistent estimator. 


Essentially, the vce (robust) option means that the standard errors are 
robust to model misspecification, provided observations are independent, but 
does not mean in general that the estimator itself is robust to 
misspecification. 


There are, however, some commonly used estimators that maintain 
consistency under relatively weak assumptions. ML and GLM estimators based 
on the LEF (see section 13.3.1) require only that the conditional mean 
function be correctly specified for parameter estimates to be consistent. And 
Iv estimators are consistent provided only that a valid instrument is used so 
that the model error u; and instrument vector Z; satisfy E (u;|z;) = O. 


The preceding discussion applies to cross-sectional estimation 
commands. For panel data or clustered data, the vce (robust) option for xt 
commands such as xt reg produces a cluster—robust estimate of the vce that 
we now present. 


13.4.6 Cluster—robust estimate of the VCE 


A common alternative to independent observations is that observations fall 
into clusters, where observations in different clusters are independent but 
observations within the same cluster are no longer independent. For 
example, individuals may be grouped into villages, with correlation within 
villages but not across villages. Such regional groupings are especially 
important to control for if the regressor of interest, such as a policy variable, 
is invariant within the region. Then, for cross-sectional estimators, the 
heteroskedastic—robust estimate of the VCE is incorrect and can be 
substantially downward biased; see section 6.4.2 for an empirical example of 
this consequence of clustered errors. 


Instead, we use a cluster—robust estimate of the vce. The first-order 
conditions can be summed within cluster and reexpressed as 


Ye (8) =0 


where c denotes the cth cluster, there are C clusters, and 

&-(9) = Uj -:c- gi(0). The key assumption is that F{g;(0)g;(0)’} = Oif 4 
and j are in different clusters. Only minor adaptation of the previous algebra 
is needed, and we obtain 


a0- (78) (ede) (8) 


where H. (0) = 0g.(@)/00'. This estimator was proposed by Liang and 
Zeger (1986), and the scaling C/(C — 1) is an ad hoc degrees-of-freedom 
correction. The estimator assumes that the number of clusters C > oo. 
When each cluster has only one observation, 

Veins (0) ={(N—K)/(N— 1)} Vion (0); the cluster—robust and robust 
standard errors then differ only by a small degrees-of-freedom correction. 


This estimator is obtained by using the vce (cluster clustvar) option, 
where clustvar is the name of the variable that defines the cluster, such as a 
village identifier. For panel data, or clustered data, xt commands such as 
xtreg already explicitly allow for clustering in estimation, and the cluster— 
robust estimate of the VCE is obtained by using the vce (robust) option 
rather than the vce(cluster clustvar) option. 


The same caveat as in the heteroskedastic—robust case applies. It is still 
necessary that E{g-(0)} = 0 to ensure consistent estimation of @. 
Essentially, the joint distribution of the g;(@) within cluster can be 
misspecified because of assuming independence when there is in fact 
dependence, but the marginal distribution of g;(@) must be correctly 
specified in the sense that E{g;(0)} = 0 for each component of the cluster. 


An additional caveat is that the validity of Vus (6) is based on 
asymptotic theory in the number of clusters. When there are few clusters, 
say, less than 30, the asymptotic theory provides a poor approximation, 
leading to test overrejection. At a minimum, one should base inference on 
the t(C’ — 1) distribution, rather than use Stata’s default for nonlinear 
commands that uses the standard normal. Even this adjustment leads to 
overrejection in practice, and this complication is an area of ongoing 
research. The community-contributed boottest command can be used to 
implement a wild cluster bootstrap following estimation of many models; 
see section 12.6. 


13.4.7 Heteroskedasticity- and autocorrelation-consistent estimate of the 
VCE 


Heteroskedasticity- and autocorrelation-consistent (HAC) estimates of the 
VCE, such as the Newey—West (1987) estimator, are a generalization of the 
robust estimate to time-series data. This permits some correlation of adjacent 
observations, up to, say, m periods apart. 


HAC estimates are implemented in Stata for some linear time-series 
estimators, such as newey for OLS and ivregress for Iv regression. For 
nonlinear estimators, HAC estimates are available for n1, glm, and gmm by 
specifying the vce (hac kernel) option, in which case you must tsset your 
data. 


Spatial HAC estimates that control for error correlation that dampens with 
distance between observations, rather than difference in time, are presented 
in section 26.6.1. 


In microeconometrics analysis, panel data have a time-series component. 
For short panels covering few time periods, there is no need to use HAC 
estimates. For long panels spanning many time periods, there is more reason 
to use HAC estimates of the vce. An example using the community- 
contributed xtscc command is given in section 9.5.5. 


13.4.8 Bootstrap standard errors 


Stata estimation commands with the vce (bootstrap) option provide 
standard errors using the bootstrap, specifically, a paired bootstrap. The 
default bootstrap assumes independent observations and is asymptotically 
equivalent to computing robust standard errors, provided that the number of 
bootstraps is large. Similarly, a cluster bootstrap that assumes independence 
across clusters but not within clusters is equivalent to computing cluster— 
robust standard errors. 


A related option is vce (jackknife). This can be computationally 
demanding because it involves recomputing the estimator N times, where N 
is the sample size. 


These methods were detailed in chapter 12. In that chapter, we also 
presented different use of the bootstrap to implement a more refined 
asymptotic theory that can lead to ¢ statistics with better size properties, and 
confidence intervals with better coverage, in finite samples. In particular, 
with few clusters, the wild cluster bootstrap can be implemented using the 
community-contributed boottest command; see section 12.6. 


13.4.9 Statistical inference 


Given a method to estimate the VCE, we can compute standard errors, t 
statistics, confidence intervals, and Wald hypothesis tests. These are 
automatically provided by estimation commands such as poisson. Some 
tests—notably, likelihood-ratio tests—are no longer appropriate once DGP 
assumptions are relaxed to allow, for example, a robust estimate of the VCE. 


More complicated statistical inference can be performed by using the 
test, testnl, lincom, and nlicom commands, which are detailed in 
section 11.3. 


13.5 Prediction 


Prediction of the conditional mean and forecast of the actual value of y 
following linear regression was presented in sections 4.2 and 4.3. 


In this section, we consider prediction following nonlinear estimation. 
Most often, the prediction is one of the conditional mean E(y|x). This can 
be much more precisely predicted than can the actual value of y given x. 


13.5.1 The predict and predictn] commands 


A new variable that contains the prediction for each observation can be 
obtained by using the predict postestimation command. After single- 
equation commands, this command has the syntax 


predict [ type | newvar [ of | [ in | ie options | 


The prediction is stored as the variable newvar and is of the data type type, 
the default being single precision. The type of prediction desired is defined 
with options, and several different types of prediction are usually available. 
The possibilities vary with the preceding estimation command. 


After poisson, the key option for the predict command is the default n 
option. This computes exp(x’, 2); the predicted expected number of events. 
The xb option calculates the linear prediction x! B. and stdp calculates 
{xV (B)x;}!/2, the standard error of x! B. The score option calculates the 
derivative of the log likelihood with respect to the linear prediction. For the 
Poisson MLE, this is y, — exp(x.3) and can be viewed as a Poisson residual. 
The pr (a) option calculates Pr(y = a), and the pr (a,b) option calculates 
Pr(a < y < b). These last two options are useful only if the Poisson is the 
correct distribution. 


The predictn1 command enables the user to provide a formula for the 
prediction. The syntax is 


predictnl [ type | newvar = pnl_exp [ of | [ in | leg options | 


where pnl_exp is an expression that is illustrated in the next section. The 
options provide quantities not provided by predict that enable Wald 
statistical inference on the predictions. In particular, the se (newvar2) option 
creates a new variable containing standard errors for the prediction newvar 
for each observation. These standard errors are computed using the delta 
method detailed in section 11.3.11. Other options include variance (), 
wald(),p(), and ci(). 


The predict and predictn1 commands act on the currently defined 
sample, with the if and in qualifiers used if desired to predict for a 
subsample. One can inadvertently predict using a sample different from the 
estimation sample. The if e (sample) qualifier ensures that the estimation 
sample is used in prediction. At other times, it is desired to deliberately use 
estimates from one sample to predict using a different sample. This can be 
done by estimating with one sample, reading a new sample into memory, and 
then predicting using this new sample. 


13.5.2 Application of predict and predictnl 


The predicted mean number of doctor visits for each individual in the sample 
can be computed by using the predict command with the default option. We 
use the if e(sample) qualifier to ensure that prediction is for the same 
sample as the estimation sample. This precaution is not necessary here but is 
good practice to avoid inadvertent error. We also obtain the same prediction 
by using predictn1; the se() option additionally obtains the standard error 
of the prediction. We obtain 


. * Predicted mean number of doctor visits using predict and predictnl 
. qui poisson docvis private chronic female income, vce(robust) 


. predict muhat if e(sample), n 


. predictnl muhat2 = exp(_b[private]*private + _b[chronic]*chronic 
> + _b[female]*female + _b[income]*income + _b[_cons]), se(semuhat2) 


. Summarize docvis muhat muhat2 semuhat2 


Variable Obs Mean Std. dev. Min Max 
docvis 4,412 3.957389 7.947601 (0) 134 
muhat 4,412 3.957389 2.985057 . 7947512 15.48004 
muhat2 4,412 3.957389 2.985057 . 7947512 15.48004 


semuhat2 4,412 . 2431483 . 1980062 .0881166 3.944615 


Here the average of the predictions of E'(y|x) is 3.957, equal to the average 
of the y values. This special property holds only for some estimators—OoLs, 
just-identified linear Iv, Poisson, logit, and exponential (with exponential 
conditional mean)—provided that these models include an intercept. The 
standard deviation of the predictions is 2.985, less than that of y. The 
predicted values range from 0.8 to 15.5 compared with a sample range of 0- 
134. 


The model quite precisely estimates E'(y|x) because from the last row, 
the standard error of exp(x’, B) as an estimate of exp(x; 6) is relatively 
small. This is not surprising because asymptotically B *, B, SO exp(x/3) A 
exp(x/Q). 


Much more difficult is using exp(x/3) to predict y;|x; rather than 
E(y;|x;) because there is always intrinsic randomness in yi. In our example, 
yi without any regression has a standard deviation of 7.95. Even if Poisson 
regression explains the data well enough to reduce the standard deviation of 
yi |X; to, say, 4, then any prediction of y;|x; will have a standard error of 
prediction of at least 4. 


More generally, with microeconometric data and a large sample, we can 
predict the conditional mean E (y;|x;) well but not y;|x;. For example, we 
may predict well the mean earnings of a white female with 12 years of 
schooling but will predict relatively poorly the earnings of a randomly 
chosen white female with 12 years of schooling. 


When the goal of prediction is to obtain a sample average predicted 
value, the sample average prediction should be a weighted average. To 
obtain a weighted average, specify weights with summarize or with mean; 
see section 3.8. This is especially important if one wants to make statements 
about the population and sampling is not simple random sampling. For 
average predictions, a simpler method is to use the margins command; see 
section 13.6. 


13.5.3 Out-of-sample prediction 


Out-of-sample prediction is possible. For example, we may want to make 
predictions for the 2001 sample using parameter estimates from the 2002 
sample. 


The dataset has data for both 2001 and 2002. We estimate using only 
2002 data, then restrict the data in memory to 2001 data and predict for 2001 
using the most recent coefficient estimates that were obtained using the 2002 
data. We have 


. * OQut-of-sample prediction for year01 data using year02 estimates 
. qui use mus210mepsdocvisyoung, clear 


. qui poisson docvis private chronic female income if year02==1, vce(robust) 


. keep if year01 == 
(23,940 observations deleted) 


. predict muhatyear01, n 


. Summarize docvis muhatyear01 


Variable Obs Mean Std. dev. Min Max 
docvis 6,184 3.896345 7.873603 (0) 152 
muhatyear01 6,184 4.086984 2.963843 . 7947512 15.02366 


Note that the average of the predictions of E(y|x), 4.09, no longer equals 
the average of the y values. 


13.5.4 Prediction at a specified value of one of the regressors 


Suppose we want to calculate the sample average number of doctor visits if 
all individuals had private insurance, whereas all other regressors are 
unchanged. 


This can be done by setting private = 1 and using predict. To return 
to the original data after doing so, we use the commands preserve to 
preserve the current dataset and restore to return to the preserved dataset. 
We have 


* Prediction at a particular value of one of the regressors 
. qui use mus210mepsdocvisyoung, clear 


. keep if year02 == 
(25,712 observations deleted) 


. qui poisson docvis private chronic female income, vce(robust) 
. preserve 


. replace private = 1 
(947 real changes made) 


. predict muhatpeqi, n 


summarize muhatpeq1 


Variable Obs Mean Std. dev. Min Max 
muhatpeqi 4,412 4.371656 2.927381 1.766392 15.48004 
. restore 


The conditional mean is predicted to be 4.37 visits when all have private 
insurance, compared with 3.96 in the sample where only 78% had private 
insurance. 


The preceding code calculates and stores the prediction for each 
individual. Usually, only the average of these predictions is required. Then it 
is much simpler to use the margins command, which also gives a 95% 
confidence interval for the average prediction; see section 13.6. 


13.5.5 Prediction at a specified value of all the regressors 


We may also want to estimate the conditional mean at a given value of all 
the regressors. For example, consider the number of doctor visits for a 
privately insured woman with no chronic conditions and an income of 
$10,000. 


To do so, we can use the Lincom and nlcom commands. These commands 
compute point estimates for linear combinations and associated standard 
errors, z statistics, P-values, and confidence intervals. They are primarily 
intended to produce confidence intervals for parameter combinations such as 
83 — G4 and are presented in detail in sections 11.3.10 and 11.3.11. They can 
also be used for prediction because a prediction is a linear combination of 
the parameters. 


We need to predict the number of doctor visits when private = 1, 
chronic = 0, female = 1, and income = 10. The nicom command has the 
form 


. * Predict at a specified value of all the regressors using nlcom 
. nlcom exp(_b[_cons]+_b[private] *1+_b[chronic] *0+_b [female] *1+_b[income] *10) 


_nl_1: exp(_b[_cons]+_b[private] *1+_b[chronic] *0+_b[female] * 
> 1+_b[income] *10) 


docvis Coefficient Std. err. Zz P>|z| [95% conf. interval] 


-nl_1 2.995338 . 1837054 16.31 0.000 2.635282 3.355394 


A simpler command for our example uses 1incom with the eform option to 
display the exponential. Coefficients are then more simply referred to as 
private, for example, rather than b[private]. We have 


. * Predict at a specified value of all the regressors using lincom 
. lincom _cons + private*i + chronic*0 + female*1 + income*10, eform 


( 1) [docvis]private + [docvis]female + 10*[docvis]income + [docvis]_cons = 0 


docvis exp(b) Std. err. Zz P>|zl [95% conf. interval] 


(1) 2.995338 . 1837054 17.89 0.000 2.656081 3.377929 


The predicted conditional mean number of doctor visits is 3.00. The standard 
error of the prediction is 0.18, and a 95% confidence interval is [2.66, 3.38]. 
The standard error is computed with the delta method, and the bounds of the 
confidence interval depend on the standard error; see section 11.3.11. The 
test against a value of 0 is not relevant here but is relevant when 1incom is 
used to test linear combinations of parameters. 


The relatively tight confidence interval is for exp(x’ B) as an estimate of 
E(y|x) = exp(x’@). If instead we want to predict the actual values of y 
given x, then the confidence interval will be much, much wider because we 
also need to add in variation in y around its conditional mean. There is 
considerable more noise in the prediction of the actual value than in 
estimating the conditional mean. 


An even simpler method, using the margins command, is presented in 
section 13.6. 


13.5.6 Prediction of other quantities 


We have focused on prediction of the conditional mean. Options of the 
predict command provide prediction of other quantities of interest, where 
these quantities vary with the estimation command. Usually, one or more 
residuals are available. Following poisson, the predict option score 
computes the residual y; — exp(x! 8). An example of more command- 
specific predictions are those following the survival data command streg to 
produce not only mean survival time but also median survival time, the 
hazard, and the relative hazard. 


For a discrete dependent variable, it can be of interest to obtain the 
predicted probability of each of the discrete values; that is, Pr(y; = 0), 
Pr(y; = 1), Pr(y; = 2), .... For binary logit and probit, the default pr () 
option of predict gives Pr(y; = 1). The pr() option is available for many 
fully parametric models and gives ranges of probabilities as well as, for 
discrete outcomes, the probability of a particular value. For the use of 
predict in multinomial models, see section 18.4. 


13.6 Predictive margins 


Predictive margins for the conditional mean x’ following OLS regression 
using the regress command were presented in some detail in section 4.4. 
That section distinguished between sample-average margins and margins at 
particular values of regressor variables, plots using marginsplot, pairwise 
comparisons of predictive margins using margins, pwcompare, and multiple 
comparisons using margins, contrast. 


That material is all relevant for nonlinear models. We give a much 
shorter presentation here that focuses on the different quantities that are 
predicted in a nonlinear model. For example, following Poisson regression, 
interest lies in the conditional mean exp(x’). 


13.6.1 The margins command 


The margins command simplifies prediction at specified values of the 
regressors. It predicts 


PM = ya (x},8) 


Different options of the command use different functions g(-) and evaluate at 
different values xž, i = 1,..., N. 


The syntax of the command is 
margins | marginlist | lif] [ in | [ weight | [ » response_options options | 


where marginiist is a list of variable names or of factor variables that appear 
in the current estimation results that are being analyzed, response_options 
specify the particular quantity to be computed, and options include particular 
values of regressors at which computation occurs. 


A key response_option 1s predict (), which defines the quantity 
predicted. For example, following the poisson command, the margins 
command can be used to compute the number of events (default option 
predict (n) ), the incident rate (option predict (ir) ), the linear prediction 
(option predict (xb) ), and probabilities (option predict (pr())). 


The response_option expression() allows the user to define the formula 
of interest. For example, margins, expression (exp (predict (xb) ) ) 
following the poisson command provides predictive margins for exp(x’ 8). 


The response_options dydx (), eyex(), dyex(), and eydx() calculate MEs 
presented in section 13.7. Key options of the margins command are the 
at() and atmeans options illustrated below. The default computes the 
sample average value of the default margins prediction. 


13.6.2 Predictive margins: Average, at specified values, and at mean 


Following the poisson command, therefore, the margins command yields 
the sample average of the predicted number of events. We have 


. * margins: Sample average of predicted number of events 
. qui poisson docvis private chronic female income, vce(robust) 


. Margins 


Predictive margins Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 


Delta-method 
Margin std. err. z P>lz| [95% conf. interval] 


_cons 3.957389 .1115373 35.48 0.000 3.73878 4.175998 


A 95% confidence interval for the sample average predicted number of 
doctor visits is [3.74, 4.18]. 


To predict at specified values of one or more regressors, we use the at () 
option. 


For example, the average number of doctor visits if all individuals had 
private insurance, with other regressors equal to sample values, is calculated 


using the following command: 


. * margins: Sample average prediction at a particular value of a regressor 
. margins, at(private=1) 


Predictive margins Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
At: private = 1 


Delta-method 
Margin std. err. z P>Izl [95% conf. interval] 


_cons 4.371656 . 1273427 34.33 0.000 4.122069 4.621243 


The average equals that given in section 13.5.4, with 95% confidence 
interval [4.12, 4.62]. 


The at () option can be used to compute the predicted number of doctor 
visits at specified values of all the regressors. For example, 


* margins: Prediction at a specified value of all regressors 
. margins, at(private=1 chronic=0 female=1 income=10) 


Adjusted predictions Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 


At: private = 1 
chronic = 0 
female = 1 
income = 10 
Delta-method 
Margin std. err. z P>lz| [95% conf. interval] 
_cons 2.995338 . 1837054 16.31 0.000 2.635282 3.355394 


The results coincide with those given in section 13.5.5. 


The atmeans option computes the predicted number of doctor visits 
when regressors are evaluated at the sample mean of regressors. From output 
not given, the margins, atmeans command yields exp(x’ B) equal to 
3.02968. This is different from 1 IN >, exp(x/ 3); the average of the 


predicted number of doctor visits, which from previous output equals 


3.957389. The average of a nonlinear function does not equal the nonlinear 


function evaluated at the average. 


13.6.3 Predictive margins for a categorical factor variable 


When the estimation command variable list includes a categorical factor 
variable, we can easily obtain average predictions for each value of the 
factor variable. For example, for private insurance, we obtain 


. * margins: Sample avg. prediction at different values of an indicator variable 
. qui poisson docvis i.private chronic female income, vce(robust) 


. margins private 


Predictive margins 
Model VCE: Robust 


Expression: Predicted number of events, predict() 


Delta-method 
Margin std. err. z P>|zl 


private 
0 
1 


1.966935 . 2054776 9.57 0.000 
4.371656 . 1273427 34.33 0.000 


Number of obs = 4,412 


[95% conf. interval] 


1.564207 2.369664 
4.122069 4.621243 


The i. operator is used in the estimation command to signal that variable 
private is a categorical variable. If the categorical variable instead took four 
values, say, then estimation would include three indicator variables (with a 
base category omitted) and the margins command would compute the 
average prediction for each of the four values of the categorical variable. 


13.7 Marginal effects 


An ME, or partial effect, most often measures the effect on the conditional 
mean of y of a change in one of the regressors, say, £j. In the linear 
regression model, the ME equals the relevant slope coefficient, greatly 
simplifying analysis. For nonlinear models, this is no longer the case, leading 
to remarkably many different methods for calculating MEs. Also, other MEs 
may be desired, such as elasticities and effects on conditional probabilities 
rather than the conditional mean. A different type of ME, the marginal 
treatment effects (MTEs), is presented at the end of this section. 


The presentation here extends the methods presented in detail in 
section 4.5 for linear regression to nonlinear models and is more detailed than 
the introductory treatment for nonlinear models in section 10.4. 


13.7.1 Calculus and finite-difference methods 


Calculus methods can be applied for a continuous regressor, and the ME of the 
jth regressor is then 


OE(y|x = x") 


ME; = 
4 Ox; 


For the Poisson model with F'(y|x) = exp(x’G), we obtain ME 
j = exp(x*’G)8,. This ME is not simply the relevant parameter (;, and it 
varies with the point of evaluation x*. 


Calculus methods are not always appropriate. In particular, for an 
indicator variable, say, d, the relevant ME is the change in the conditional 
mean when d changes from 0 to 1. Let x = (z d), where z denotes all 
regressors other than the jth, which is an indicator variable dq. Then the finite- 
difference method yields the ME 


ME; = E(y|z = z*,d = 1) — E(y|z = z*,d = 0) 


Even for continuous regressors, we may want to consider discrete changes, 
such as the impact of age increasing from 40 to 60. Letting x = (z w), the 
finite-difference method uses 


ME; = E(y|z = z*,w = 60) — E(y|z = z*, w = 40) 


The contrast and margins, contrast commands use the finite-difference 
method. 


For the linear regression model, with all regressors entering linearly 
rather than interactively, calculus and finite-difference methods give the same 
result. For nonlinear models, this is no longer the case. If interest lies in a unit 
change in the regressor then in nonlinear models calculus methods only 
provide an approximation. 


Finally, as for linear models, interactions in regressors lead to additional 
complications. 


13.7.2 Average marginal effect, ME at mean, and ME at a representative 
value 


For nonlinear models, the ME varies with the point of evaluation. Three 
common choices of evaluation are 1) at sample values and then average, 2) at 
the sample mean of the regressors, and 3) at representative values of the 
regressors. We use the following acronyms, where the first two follow 

Bartus (2005): 


AME AME Average of ME at each x = x; 
MEM Marginal effect at mean ME at x =X 
MER Marginal effect at a representative value ME at x = x* 


In a nonlinear model, these provide different ME estimates. The in-sample 
AME is often used because it provides a simple interpretation of results 


obtained from a nonlinear model when direct interpretation of the coefficients 
can be difficult. The MEM is used if interest lies in the ME for the average 
individual in the sample, while the MER is used if interest lies in the ME for an 
individual with a specific set of characteristics. 


The three quantities can be computed using the margins postestimation 
command with the dydx() option. Combinations of these methods are also 
possible. For example, several regressors may be set to a representative 
value, while the remaining regressors are evaluated at sample values and then 
the average is taken. 


13.7.3 Simple interpretations of coefficients in single-index models 
In nonlinear models, coefficients are more difficult to interpret because now 


Bi # OE(y|x) /Ox,. Nonetheless, some direct interpretation is possible if the 
conditional mean is of the single-index form 


E(y|x) = m(x’B) 
This single-index form implies that the ME is 
ME; = m'(x'B) x bj 


where m/’(x’3) denotes the derivative of m(x’@) with respect to x’. 


Two important properties follow. First, if m(x’@) is monotonically 
increasing, so m’(x’G) > 0 always, then the sign of B; gives the sign of the 


ME [and if m(x’ 68) is monotonically decreasing, then sign of B; is the 
negative of that of the ME]. Second, for any function m(-) and at any value of 
x, we have 


ME; _ pj 
ME, Bx 


Therefore, if one coefficient is twice as big as another, then so too is the ME. 
These two properties apply to most commonly used nonlinear regression 
models, aside from multinomial models. 


For example, from section 13.3.2, the regressor private has a coefficient 
of 0.80, and the regressor chronic has a coefficient of 1.09. It follows that 
having a chronic condition is associated with a bigger change in doctor visits 
than having private insurance because 1.09 > 0.80. The effects for both 
regressors are positive because the coefficients are positive and the 
conditional mean exp(x’@) is a monotonically increasing function. 


Additional interpretation of coefficients can be possible for specific 
single-index models. In particular, for the exponential conditional mean 
exp(x’@), the ME; = E(y|x) x 8j. So 8; = ME; /E(y|x), and from (13.12) 
the regression coefficients can be interpreted as semielasticities. From 
section 13.3.2, the regressor income has a coefficient of 0.0036. It follows 
that a $1,000 increase in income (a one-unit increase in the rescaled regressor 
income) is associated with a 0.0036 proportionate increase, or a 0.36% 
increase, in doctor visits. 


Using instead the finite-difference method, we see a one-unit change in 
£j that implies that x’ 8 changes to x’G + 8j, SO ME 
j = exp(x’B + 8;) — exp(x’B) = (ef? — 1) exp(x’@). This is a 
proportionate increase of (ef; — 1), or a percentage change of 
100 x (efi — 1). 


13.7.4 The dydx() option of the margins command for MEs 


The margins postestimation command is detailed in section 13.6.1. The 
dydx() option computes MEs. The variant dydx (*) computes the ME for all 
regressors. Alternatively, a subset of regressors can be explicitly listed in the 
parentheses. 


The default is to compute the AME. Alternatively, the MEM can be 
computed using the atmeans option, and the MER can be computed using the 
at () option. 


All of these MEs are, by default, calculated using calculus methods. To 
instead use the finite-difference method for binary regressors, one must use 
the i. operator in the preceding estimation command to signal that the 
regressor is a binary variable. This additional step is strongly preferred 
because the change of interest is from one value of the binary variable to the 
other, whereas calculus methods consider an infinitesimally small change 
whose effect is then scaled up to correspond to a large change. The two 
estimates differ in a nonlinear model. 


The default standard errors treat the regressors as fixed. If a population 
ME is desired and sample values of regressors are used in computing the ME, 
then the option vce (unconditional) additionally controls for variation in the 
regressors because of sampling; see section 13.7.9. 


The default MEs are computed for the default prediction for the predict 
command. For many nonlinear models, such as the Poisson count model, the 
default Mmes are therefore computed for E'(y|x). For multinomial models, 
such as multinomial logit, the default MEs are instead computed for 
Pr(y = j|x), j =1,...,m, that is, the probability for each of the m 
outcomes. For models with censoring or selection, such as the tobit or 
heckman command, the default MEs are computed for the index x’ so that 
the default ME is simply the estimated regression coefficient. 


MEs for some other quantities can be computed using the 
predict (predict _ option) option of the margins command, where the 
specific predict option varies with the preceding estimation command. And 
the expression () option (see section 13.6.1) can be used to define a quantity 
of interest. Section 13.7.8 provides an ME example. 


13.7.5 Average marginal effect 


The margins command with just the ayax () option yields the ames for the 
default option of the predict command. After poisson, the default 
prediction is one for E[y|z]. 


The default is to use calculus methods to compute the ME. For the three 
binary regressors, we instead use the finite-difference method by using the 


i. operator in the poisson estimation command before applying the margins 
command. 


For the doctor-visits data, with the i. operator used in estimation so that 
the AMEs for binary regressors are computed using the finite-difference 
method, we obtain the following: 


. * AMEs using margins command and finite differences 
. qui poisson docvis i.private i.chronic i.female income, vce(robust) 


. margins, dydx(*) 


Average marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: 1.private 1.chronic 1.female income 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 

1.private 2.404721 . 2438573 9.86 0.000 1.926769 2.882672 
1.chronic 4.599174 . 2886176 15.94 0.000 4.033494 5.164854 
1.female 1.900212 . 2156694 8.81 0.000 1.477508 2.322917 
income 0140765 . 0043457 3.24 0.001 .0055591 .0225939 


Note: dy/dx for factor levels is the discrete change from the base level. 


In this example, the AMEs are about three times the estimated Poisson 
coefficients. 


When we average across individuals, people with private insurance have 
on average 2.40 more doctor visits than those without private insurance after 
controlling for income, gender, and chronic conditions. 


The AMEs for single-index models, such as the Poisson model, are in 
practice quite similar to the coefficients obtained by OLS regression of y on x. 
This is indeed the case here because the OLS coefficients, from output not 
given, are 1.92, 4.82, 1.89, and 0.016. 


Comparison of calculus and finite-difference methods 


To show that calculus and finite-difference methods can differ considerably, 
we repeat the command following the poisson command without using the 


i. operator to signal which regressors are binary variables. Then calculus 
methods are used for all variables. We obtain 


. * AMEs using margins command and only calculus method 
. qui poisson docvis private chronic female income, vce(robust) 


. margins, dydx(*) 


Average marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: private chronic female income 


Delta-method 


dy/dx std. err. z P>I|zl [95% conf. interval] 

private 3.160629 4352572 7.26 0.000 2.307541 4.013717 
chronic 4.320935 2757872 15.67 0.000 3.780402 4.861468 
female 1.949204 . 2313726 8.42 0.000 1.495722 2.402686 

income .0140765 . 0043457 3.24 0.001 .0055591 .0225939 


The ames for the three binary regressors change, respectively, from 2.40 to 
3.16, 4.60 to 4.32, and 1.90 to 1.95, while the AME for the continuous 
regressor income is unchanged. For binary regressors, the first method, using 
finite differences, is conceptually better. 


13.7.6 Marginal effect at the mean 


The margins command with the dydx() and atmeans options yields the ME 
evaluated at the mean for the default option of the predict command. After 
poisson, the default prediction is one for E(y|x), so the MEs give the change 
in the expected number of doctor visits when the regressor changes, 
evaluated at x = X. 


We have 


. * MEMs using margins command and finite differences 
. qui poisson docvis i.private i.chronic i.female income, vce(robust) 


. Margins, dydx(*) atmeans 


Conditional marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: 1.private 1.chronic 1.female income 


At: O.private = .2146419 (mean) 
1.private = .7853581 (mean) 
O.chronic = .6736174 (mean) 
1.chronic = .3263826 (mean) 
O.female = .5281052 (mean) 
1.female = .4718948 (mean) 
income = 34.34018 (mean) 
Delta-method 

dy/dx std. err. z P>lz| [95% conf. interval] 

1.private 1.978178 . 204407 9.68 0.000 1.577547 2.378808 

1.chronic 4.200068 . 2794096 15.03 0.000 3.652435 4.7477 

1.female 1.528406 .1775762 8.61 0.000 1.180363 1.876449 

income .0107766 . 0033149 3.25 0.001 .0042796 .0172737 


Note: dy/dx for factor levels is the discrete change from the base level. 


The output includes the regressor values at which the MEs are calculated. In 
subsequent examples, we use the noat legend option to suppress this part of 
the output. 


For the average individual, where average means evaluation at the sample 
means of the regressors, the number of doctor visits increases by 1.98 for 
those with private insurance and by 0.011 with a $1,000 increase in annual 
income. 


For nonlinear models, average behavior of individuals differs from 
behavior of the average individual. The direction of the difference is 
generally indeterminate, but for models with an exponential conditional 
mean, it can be shown that the atmeans option (giving the MEM) will produce 
smaller MEs than the default (giving the AME), a consequence of the 
exponential function being globally convex. 


Thus, in this example, the MEMs are, respectively, 1.98, 4.20, 1.53, and 
0.011, while the AMEs are, respectively, 2.40, 4.60, 1.90, and 0.014. 


13.7.7 Marginal effect at a representative value 


The margins command with the dydx() and at () options yields the ME 
evaluated at a particular value of the regressors. 


As an example, we consider computing the MEs for a privately insured 
woman with no chronic conditions and income of $10,000. We use the 
i. Operator in estimation so that the MEs for binary regressors are computed 
using the finite-difference method. We have 


. * MERS using margins command 
. qui poisson docvis i.private i.chronic i.female income, vce(robust) 


. margins, dydx(*) at(private=1 chronic=0 female=1 income=10) noatlegend 


Conditional marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: 1.private 1.chronic 1.female income 


Delta-method 


dy/dx std. err. Zz P>|zl [95% conf. interval] 

1.private 1.647648 . 2072813 7.95 0.000 1.241385 2.053912 
1.chronic 5.930251 4017655 14.76 0.000 5.142805 6.717697 
1.female 1.164985 . 1546063 7.54 0.000 8619621 1.468008 
income .0106545 . 0028688 3.71 0.000 . 0050317 .0162772 


Note: dy/dx for factor levels is the discrete change from the base level. 


For example, having private insurance is estimated to increase the number of 
doctor visits by 1.65, with 95% confidence interval [1.24, 2.05]. Note that the 
confidence interval is for the change in the expected number of visits E (y|x). 
A confidence interval for the change in the actual number of visits (y|x) will 
be much wider; see the analogous discussion for prediction in section 13.5.2. 


Related at () options allow evaluation at a mix of user-provided values 
and sample means of regressors. 


13.7.8 MEs with respect to a user-defined quantity 


The expression() option of the margins command can be used to obtain 
MEs with respect to a user-defined quantity. For example, following Poisson 


regression, the preceding default AMEs for the conditional mean can 
equivalently be obtained using the command 


. * AME semielasticity evaluated at mean of regressors 
. margins, dydx(*) expression(exp (predict (xb) )) 


(output omitted ) 


Here predict (xb) following Poisson regression gives x’ 3, and 
exponentiating then gives the conditional mean. 


13.7.9 Which ME to use? 


The three MEs—AME, MEM, and MER—can differ appreciably in nonlinear 
models. Which ME should be used? It is common to use the AME in reporting 
regression results. For policy analysis, one may want to instead obtain the 
MER for targeted values of the regressors by using the at () option of the 
margins command or the MEM for the average individual by using the 
atmeans option. 


The AME is most often reported as an in-sample AME. If instead this AME is 
to be interpreted as a population AME, then standard errors for the AME should 
be computed using the vce (unconditional) option of the margins command 
because it additionally controls for variation in the regressors due to 
sampling. This is appropriate if interest lies in extrapolating the impact of a 
policy impact measured in one setting, such as a particular school district, to 
another school district drawn from a population of similar school districts. 
There are then two sources of randomness in computing MEs—randomness in 
the parameter estimates and randomness in the regressors. The first of these is 
usually much larger; in the preceding AME example, adding 
vce (unconditional) leads to an increase in AME standard errors of less than 
1%. 


Microeconometric studies often use survey data that are stratified on 
exogenous regressors, in which case the estimators are consistent but the 
regressors x may be unrepresentative of the population. For example, a 
nonlinear regression of earnings (y) on schooling (x) may use a dataset for 
which individuals with low levels of schooling are oversampled, so that the 
sample mean 7 is less than the population mean. Then MER can be used with 
no modification. But MEM and AME should be evaluated by using sampling 


weights, introduced in section 3.8. In particular, the AME should be computed 
as a weighted average, using sampling weights, of the ME for each individual. 
Estimation itself is most often unweighted, though if weights are used in 
estimation, then calculations in margins automatically adjust for any weights 
used during estimation; this can be changed by using the noweights option of 
the margins command. 


13.7.10 AME computed manually 


The AME can always be computed manually using the following method. 
Predict at the current sample values for all observations, change one regressor 
by a small amount, predict at the new values, subtract the two predictions, 
and divide by the amount of the change. The AME is the average of this 
quantity. By choosing a very small change, we replicate the calculus method. 
A finite-difference estimate is obtained by considering a large change such as 
a one-unit change (whether this is large depends on the scaling of the 
regressors). In either case, we use the preserve and restore commands to 
return to the original data after computing the AME. 


We consider the effect of a small change in income on doctor visits. This 
yields a crude approximation to the derivative. For more precise numerical 
estimates of the derivative, see section 5.7 of Press et al. (2007). We have 


. * AME computed manually for a single regressor 
. qui use mus210mepsdocvisyoung, clear 


. keep if year02 == 
(25,712 observations deleted) 


. qui poisson docvis private chronic female income, vce(robust) 
. preserve 

. predict mud, n 

. qui replace income = income + 0.01 

. predict mul, n 

. generate memanual = (mui-mu0)/0.01 

. Summarize memanual 


Variable Obs Mean Std. dev. Min Max 


memanual 4,412 .0140761 .0106173 . 0028253 .055027 


. restore 


The AME estimate is 0.0140761, essentially the same as the 0.0140765 
obtained by using margins in section 13.7.5. This method gives no standard 
error for the AME estimate. Instead, it computes the standard deviation of the 
AME for the 4,412 observations. 


A better procedure chooses a change that varies with the scaling of each 
regressor. We use a change equal to the standard deviation of the regressor 
divided by 1,000. We also use the looping command foreach to obtain the 
AME for each variable. We have the following: 


. * AME computed manually for all regressors 
. global xlist private chronic female income 


. preserve 
. predict mu0, n 


. foreach var of varlist $xlist { 


2. qui summarize `var’ 

3. generate delta = r(sd)/1000 

4. qui generate orig = `var’ 

5. qui replace “var” = “var” + delta 

6. predict mul, n 

7. qui generate me_~var~ = (mul - mu0)/delta 
8. qui replace “var” = orig 

9. drop mui delta orig 

10. } 

summarize me_* 

Variable Obs Mean Std. dev. Min Max 
me_private 4,412 3.16153 2.384785 .6349193 12.36743 
me_chronic 4,412 4.322181 3.260333 .8679963 16.90794 

me_female 4,412 1.949399 1.470413 .3915812 7.625329 
me_income 4,412 0140772 0106184 .0028284 .055073 
. restore 


The AME estimate for income is the average ().0140772, essentially the same 
as the 0.0140765 produced by using the margins, dydx() command. The 
other AMEs differ from those given in section 13.7.5 because here we used 
calculus methods, whereas in section 13.7.5, the AMEs for binary regressors 
were computed using the finite-difference method. 


The code can clearly be adapted to use finite differences in those cases. In 
nonstandard models, such as those fit by using the mi command, it will be 
necessary to also provide code to replace the predict command. 


Note that the reported standard deviation, such as 0.0106184 for 
me income, is the standard deviation across individuals of the estimated ME. 
To compute the standard error of the AME, one then needs to embed the 
preceding code in a bootstrap loop. 


13.7.11 Polynomial regressors 


Regressors may appear as polynomials. Then computing MEs becomes 
considerably more complicated. 


First, consider a linear model that includes a cubic function in regressor z. 


Then, 


E(y|x,z) =x’B+aiz+a927 +032° 
= ME, = qQ] + 2a9z+ 3032? 


The AME can be computed by calculating & + 281z + 38122 for each 
observation and averaging. 
We do so for a slightly more difficult example, the Poisson model. Then, 


E(y|x,z) = exp (xB + a1z + a22? + a32’) 
=> ME, = exp (x'B +012 + ag2z7 + 032°) x (ay + 2aqz + 30327) 


A cubic function in income is part of our model for doctor visits. Below, 
we estimate the parameters of the model by ML. 


* AME for a polynomial regressor: Manual computation 
. generate inc2 = income”2 


. generate inc3 = income™3 
. qui poisson docvis private chronic female income inc2 inc3, vce(robust) 
. predict muhat, n 
. generate me_income = muhat*(_b[income]+2+*_b[inc2] *income+3*_b[inc3] *inc2) 
summarize me_income 
Variable Obs Mean Std. dev. Min Max 


me_income 4,412 .0178233 .0137618 -.0534614 .0483436 


The code uses the simplification that ME 

z = E(y|x, z) x (a1 + 2a2z + 30327). The average of the individual MEs of 
a change in income is 0.0178 in the cubic model, compared with 0.0141 
when income enters only linearly. 


This AME can be more simply computed using factor variables. From 
section 1.3.4, a polynomial in variable income can be specified using the 
c. operator (for a continuous variable) and the # operator for interaction. The 
ME Of income can then be computed using the usual margins command. We 
have 


. * AME for a polynomial regressor: Computation using factor variables 

. qui poisson docvis private chronic female c.income c.income#c.income 

> c.income#c.income#c.income, vce(robust) 

. margins, dydx(income) 

Average marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: income 


Delta-method 
dy/dx std. err. z P>lz| [95% conf. interval] 


income .0178233 . 0062332 2.86 0.004 . 0056064 . 0300402 


The AME equals that computed manually, but additionally, we have a 95% 
confidence interval of [0.0056, 0.0300]. The MEM and MER can also be 
computed with the atmeans and at () options. 


13.7.12 Interacted regressors 


Similar issues arise with regressors interacted with indicator variables. For 
example, 


E(y|x,z,d) = exp(x'ß + a ,z + aed + asd x z) 
=> ME, = exp(x'ß + aız + azd + azd x z) x (a1 + a3d) 


Long and Freese (2014) give an extensive discussion of MEs with 
interactive regressors, performed by using the community-contributed 
prvalue command presented in section 17.5.7. These are oriented toward 
calculation of the MEM or MER rather than the AME. 


Factor variables automate the computation of MEs in models with 
interactions. As an example, we modify the doctor-visits model to include an 
interaction between the binary regressor female and the continuous regressor 


income. 


We first need to specify the model fit with factor variables, using the 
relevant c., i., and # operators. We have 


. * Specify model with interacted regressors using factor variables 
. poisson docvis private chronic i.female c.income i.female#c.income, 
> vce(robust) nolog 


Poisson regression Number of obs = 4,412 
Wald chi2(5) = 606.43 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -18475.536 Pseudo R2 = 0.1943 
Robust 
docvis Coefficient std. err. z P>|z| [95% conf. interval] 
private . 802035 . 1084187 7.40 0.000 .5895383 1.014532 
chronic 1.094331 .0563504 19.42 0.000 . 9838865 1.204776 
1.female .6328542 .0927712 6.82 0.000 .451026 .8146823 
income .0051734 .0015708 3.29 0.001 . 0020946 .0082522 
female#c.income 
1 -.0035176 .0019089 -1.84 0.065 -.0072589 . 0002237 
_cons -.3082426 . 1242329 -2.48 0.013 -.5517346 -.0647506 


The coefficient of female is labeled 1. female, and the coefficient of 
female x income is labeled 1. female#c.income. 


We then compute the AME of variables female and income. We have 


. * AME with interacted regressors given model specified using factor variables 
. Margins, dydx(female income) 


Average marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: 1.female income 


Delta-method 
dy/dx std. err. Zz P>|zl [95% conf. interval] 
1.female 1.884021 . 2159008 8.73 0.000 1.460864 2.307179 
income .0116832 .0037989 3.08 0.002 . 0042375 .019129 


Note: dy/dx for factor levels is the discrete change from the base level. 


The average of the individual MEs of a change in income is 0.0117 compared 
with 0.0178 in the cubic model and 0.0141 when income enters only linearly. 
The AME of moving from male (female = 0) to female (female = 1) is 
1.884 office visits, compared with 1.900 from section 13.7.5 where there was 
no interaction. The output includes 95% confidence intervals for the AMEs. 
The MEMs and MERs can alternatively be computed, using the atmeans and 

at () options of the margins command. 


13.7.13 Complex interactions and nonlinearities 


MES in models with interactions can become very difficult to interpret and 
calculate. 


For complex interactions, a simple procedure is to compute the ME by 
manually changing the relevant variables and interactions, recomputing the 
predicted conditional mean, and subtracting. We change by an amount A, a 
single variable that, because of interactions or polynomials, or both, appears 
several times as a regressor. Let the original values of X; be denoted by Xo, 
and obtain the prediction 77;) = exp(x’, of): Then, change the variable by A 
to give new values of x; denoted by Xi1, and obtain the prediction 
fii = exp(x’, 3). Then, the ME of changing the variable is (fi;, — fijo)/A. 


We illustrate this for the cubic in income example. 


* AME computed manually for a complex model 
. preserve 


. predict mu0, n 

. qui summarize income 

. generate delta = r(sd)/100 

. qui replace income = income + delta 
. qui replace inc2 = income”2 

. qui replace inc3 = income73 

. predict mul, n 


. generate me_inc = (mui - mu0)/delta 


summarize me_inc 


Variable Obs Mean Std. dev. Min Max 
me_inc 4,412 .0116899 .0089482 .0022914 . 1083069 
. restore 


This reproduces the calculus result because it considers a small change in 
income, here one-hundredth of the standard deviation of income. If instead 
we had used deita=1, then this program would have given the ME of a one- 
unit change in the regressor income (here a $1,000 change) computed by 
using the finite-difference method. 


For complex interactions, one can use factor variables, as shown in 
section 13.7.12, although for complex interactions, care is needed to correctly 
specify the factor variables and their interactions in the initial model 
estimation command. 


13.7.14 Elasticities and semielasticities 


The impact of changes in a regressor on the dependent variable can also be 
measured by using elasticities and semielasticities. 


For simplicity, we consider a scalar regressor x, and the effect of a change 
in x on E(y|x), which we write more simply as y. Then the ME using the 
finite-difference method is given by 


_ Ay 


ME = —— 
Ax 


This measures the change in y associated with a one-unit change in x. We 
present elasticities based on finite differences. If instead calculus methods are 
used, we replace Ay/Az in the equations below with the derivative 0y/Ox. 


An elasticity instead measures the proportionate change in Y associated 
with a given proportionate change in x. More formally, the elasticity € is 
given by 


— Ay/y Ay a. x 
gg he ye (13.11) 


For example, if y = 1 + 2x, then Ay/Az = 2 and the elasticity at 7 = 3 
equals 2 x 3/7 = 6/7 = 0.86. This can be interpreted as follows: a 1% 
increase in x is associated with a 0.86% increase in y. 


Elasticities can be more useful than MEs because they are scale-free 
measures. For example, suppose we estimate that a $1,000 increase in annual 
income is associated with 0.1 more doctor visits per year. Whether this is a 
large or small effect depends on whether these changes in income and doctor 
visits are large or small. Given knowledge that the sample means of income 
and doctor visits are, respectively, $34,000 and 4, the elasticity 
e = 0.1 x 34/4 = 0.85. This is a large effect. For example, a 10% increase in 
income is associated with an 8.5% increase in doctor visits. 


A semielasticity is a hybrid of an ME and an elasticity that measures the 
proportionate change in y associated with a one-unit change in x. The 
semielasticity is given by 


Ay/y Ay 1 1 
ee a eS E = . 
z r x 7 ME x 7 (13.12) 


For the preceding example, the semielasticity is 0.1/4 = 0.025, so a $1,000 
increase in income (a one-unit change given that income is measured in 


thousands of dollars) is associated with a 0.025 proportionate increase, or a 
2.5% increase, in doctor visits. 


Less used is the unit change in y associated with a proportionate change 
in x, given by 


Ay Ay x ME x 
= —— == 
Dale Ag . 


These four quantities can be computed by using various options of the 
margins command, given in section 13.7.4. An illustration is given in 
section 13.7.15. 


13.7.15 Elasticities and semielasticities example 


The elasticities and semielasticities defined in section 13.7.14 can be 
computed by using margins’s options eyex (), eydx(), and dyex (). 
Evaluation can be at each sample value and then averaged (the default 
option), at the mean value of regressors (the atmeans option), or at a 
representative value of the regressors (the at () option). 


We continue with the same Poisson regression example, with four 
regressors, but we focus on the impact of just the regressor income. For 
illustration, we evaluate elasticities and semielasticities at the mean value of 
regressors by using the atmeans option. 


We first obtain the ME with the ayax () option. 


. * Usual ME evaluated at means of regressors 
. gui poisson docvis private chronic female income, vce(robust) 


. margins, dydx(income) atmeans noatlegend 


Conditional marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: income 


Delta-method 
dy/dx std. err. Zz P>lz| [95% conf. interval] 


income .0107766 . 0033149 3.25 0.001 . 0042796 .0172737 


For the average individual, the number of doctor visits increases by 0.0108 
with a $1,000 increase in annual income. This repeats the result given in 
section 13.7.6. 


We next compute the elasticity. The eyex () option yields 


. * Elasticity evaluated at means of regressors 
. margins, eyex(income) atmeans noatlegend 


Conditional marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
ey/ex wrt: income 


Delta-method 
ey/ex std. err. z P>lz| [95% conf. interval] 


income . 1221485 .0371729 3.29 0.001 .049291 . 195006 


The elasticity is 0.122, so for the average individual, a 1% increase in income 
is associated with a 0.122% increase in doctor visits, or a 10% increase in 
income is associated with a 1.22% increase in doctor visits. The elasticity 
equals ME x x/y from (13.11), where evaluation here is at 7 = 34.34 (the 
sample mean of income) and y = 3.03 (the predicted number of doctor visits 
for x = X computed using the margins, atmeans command). This yields 
0.0108 x 34.34/3.03 = 0.122 as given in the above output. 


The semielasticity is obtained with the eydx() option: 


. * Semielasticity evaluated at means of regressors 
. margins, eydx(income) atmeans noatlegend 


Conditional marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
ey/dx wrt: income 


Delta-method 
ey/dx std. err. Zz P>|z| [95% conf. interval] 


income .003557 .0010825 3.29 0.001 0014354 . 0056787 


For the average individual, a $1,000 increase in annual income (a one-unit 
change in income) is associated with a 0.003557 proportionate rise, or a 
0.3557% increase in the number of doctor visits. This exactly equals the 
coefficient of income in the original Poisson regression (see section 13.3.2), 
confirming that if the conditional mean is of exponential form, then the 
coefficient B; is already a semielasticity, as explained in section 13.7.3. 


Finally, the dyex() option yields 


. * Other semielasticity evaluated at means of regressors 
. Margins, dyex(income) atmeans noatlegend 


Conditional marginal effects Number of obs = 4,412 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/ex wrt: income 


Delta-method 
dy/ex std. err. Zz P>|z| [95% conf. interval] 


income . 3700708 . 1138338 3.25 0.001 . 1469607 .5931809 


A proportionate increase of one in income (a doubling of income) is 
associated with 0.37 more doctor visits. Equivalently, a 1% increase in 
income is associated with 0.0037 more doctor visits. 


13.7.16 Marginal treatment effects 
The treatment-effects literature measures the effect on an outcome variable of 


a discrete change in a treatment variable, often a binary variable, in a setting 
where responses to the treatment variable differ across individuals. Key 


measures of treatment effects covered in this book include average treatment 
effects, average treatment effect on the treated, and local average treatment 
effect. As Heckman and Vytlacil (2007) have shown, many popular measures 
of treatment effects can be interpreted as weighted sums of MTE usually 
estimated in a regression framework. In that sense, MTE is a building block for 
constructing alternative measures of treatment effects. Chapters 24 and 25 
provide several examples. 


13.8 Model diagnostics 


As for the linear model, the modeling process follows a cycle of estimation, 
diagnostic checks, and model respecification. Here we briefly summarize 
diagnostics checks with model-specific checks deferred to the models 
chapters. 


13.8.1 Goodness-of-fit measures 


The 2 in the linear model does not extend easily to nonlinear models. 
When fitting by NLS a nonlinear model with additive errors, y = m(x’3) + u 
, the residual sum of squares (RSS) plus the model sum of squares (Mss) do 
not sum to the total sum of squares (TSS). So the three measures MSS/TSS, 

1 — RSS/TSS, and pe 5 [the squared correlation between y and m(x’B)] 
differ. By contrast, they all coincide for OLS estimation of the linear model 
with an intercept. Furthermore, many nonlinear models are based on the 
distribution of y and do not have a natural interpretation as a model with an 
additive error. 


A fairly universal measure of association in nonlinear models is D; gp the 


squared correlation between y and 7. This has a tangible interpretation and 
can be used provided that the model yields a fitted value 7. This is the case 
for most commonly used models except multinomial models. 


For ML estimators, Stata reports a pseudo- R? defined as 
R= E li ieee (13.13) 
where In Lọ is the log likelihood of an intercept-only model and In Lgs is the 


likelihood of the fitted model. 


For doctor visits, we have 


. * Compute pseudo-R-squared after Poisson regression 
. qui poisson docvis private chronic female income, vce(robust) 


. display "Pseudo-R*2 = " 1 - e(11)/e(11_0) 
Pseudo-R°2 = .19303857 


This equals the statistic Pseudo R2 that is provided as part of poisson output; 
see section 13.3.2. 


For discrete dependent variables, R2 has the desirable properties that 
R? > 0, provided that an intercept is included in the model and R2 increases 
as regressors are added for models fit by ML. For binary and multinomial 
models, the upper bound for 72 is 1, whereas for other discrete data models 
such as Poisson, the upper bound for 722 is less than 1. For continuous data, 
these desirable properties disappear, and it is possible that R2 > 1 or R2 < 0 
and that 72 does not increase as regressors are added. 


To understand the properties of R2, let In Lmax denote the largest 
possible value of In L(@). Then we can compare the actual gain in the 
objective function due to inclusion of regressors compared with the 
maximum possible gain, giving the relative gain measure 


9 In Let — In Lo In Lmax — In Let 
Rhe = = 1 
ln Lmax — ln Lo ln Lmax — ln Lo 


In general, In Lmax 1S not known, however, making it difficult to implement 
this measure (see Cameron and Windmeijer [1997]). For binary and 
multinomial models, it can be shown that In Lmax = 0 because perfect 
ability to model the multinomial outcome gives a probability mass function 
with a value of 1 and a natural logarithm of 0. Then R2,, simplifies to R2, 
given in (13.13). For other discrete models, such as Poisson, In Lmax < 0 
because the probability mass function takes a maximum value of less than 1, 
so the maximum R2 < 1. For continuous density, the log density can exceed 
0, so it is possible that In Lg, > In Lo > 0 and R2 < 0. An example is given 
as an end-of-chapter exercise. 


13.8.2 Information criteria for model comparison 


For ML models that are nested in each other, we can discriminate between 
models on the basis of a likelihood-ratio (LR) test of the restrictions that 
reduce one model to the other; see section 11.4. 


For ML models that are nonnested, a standard procedure is to use 
information criteria. Two standard measures are Akaike’s information 
criterion (AIC) and Schwarz’s Bayesian information criterion (BIC). Different 
references use different scalings of these measures. Stata uses 


AIC —2inL+2k 
BIC =-—-2InL+kKInN 


Smaller AIC and BIC are preferred because higher log likelihood is preferred. 
The quantities 2K and K ln N are penalties for model size. 


If the models are actually nested, then an LR test statistic [equals 
A(2 In L)] could be used. Then the larger model is favored at a level of 0.05 
if A(21n L) increases by y2 9;(AK’). By comparison, the alc favors the 
larger model if A(21n L) increases by 2A K, which is a smaller amount [for 
example, if AK = 1, then 2 < y2 o5 (1) = 3.84]. The alc penalty is too 
small. The BIC gives a larger model-size penalty and is generally better, 
especially if smaller models are desired. 


These quantities are easily displayed using the estat ic command. For 
the Poisson regression with the five regressors, including the intercept, we 
have 


. * Report information criteria 
. estat ic 


Akaike’s information criterion and Bayesian information criterion 


Model N 11(null) 11(model) df AIC BIC 


4,412 -22929.9 -18503.55 5 37017.1 37049 .06 


Note: BIC uses N = number of observations. See [R] BIC note. 


The information criteria and LR test require correct specification of the 
density, so they should instead be used after the noreg command for 


negative binomial estimation because the Poisson density is inappropriate 
for these data. 


It is possible to test one nonnested likelihood-based model against 
another, using the LR test of Vuong (1989). A general discussion of Vuong’s 
test is given in Greene (2018, 751) and Cameron and Trivedi (2005, 280— 
283). Subsequent research by Shi (2015) shows that the Vuong test can have 
size distortion and proposes a better test. The journal website includes Stata 
code to implement this better test. 


13.8.3 Residuals 


Analysis of residuals can be a useful diagnostic tool, as demonstrated in 
sections 3.6 and 6.3 for the linear model. 


In nonlinear models, several different residuals can be computed. The 
use of residuals and methods for computing residuals vary with the model 
and estimator. A natural starting point is the raw residual, y; — y;. In 
nonlinear models, this is likely to be heteroskedastic, leading to use of the 
Pearson residual (y; — 7) /G;, where G? is an estimate of Var(y;|x;). The 
Pearson statistic is the sum of squared Pearson residuals. 


For GLMs defined, in section 13.3.7, it is common to use the deviance 
residual d; = sign(y; — fi) /2{U(y;) — U(fa;) }, where L(y) is the log density 
evaluated at 4 = y and (f) is the log density evaluated at u = fi. For 
homoskedastic normal errors, this reduces to the raw residual. The deviance 
statistic is the sum of the squared deviance residuals and is the 
GLM generalization of the sum of squared residuals. 


For the Poisson model fit with the g1m command, various options of the 
predict postestimation command yield various residuals, and the mu option 
gives the predicted conditional mean. We have 


* Various residuals after command glm 
. qui glm docvis private chronic female income, family(poisson) vce(robust) 


. predict mu, mu 

. generate uraw = docvis - mu 
. predict upearson, pearson 

. predict udeviance, deviance 


. predict uanscombe, anscombe 


summarize uraw upearson udeviance uanscombe 


Variable Obs Mean Std. dev. Min Max 

uraw 4,412 -3.34e-08 7.408631 -13.31806 125.3078 
upearson 4,412 -.0102445 3.598716 -3.519644 91.16232 
udeviance 4,412 -.5619514 2.462038 -4.742847 24.78259 
uanscombe 4,412 -.5995917 2.545318 -5.03055 28.39791 


The Pearson residual has a standard deviation much greater than the 
expected value of | because it uses £? = j7;, when in fact there is 
overdispersion and øg? is several times this. The Anscombe residual uses the 
transformation of y closest to normality and then standardizes to mean 0 and 
variance 1. The deviance and Anscombe residuals are quite similar. The 
various residuals differ mainly in their scaling; for these data, the pairwise 
correlations between the residuals exceed 0.92. 


Other options of predict after gim allow for adjustment of deviance 
residuals and the standardizing and studentizing of the various residuals. The 
cooksd and hat options aid in finding outlying and influential observations, 
as for the linear model. For definitions of all of these quantities, see [R] glm 
postestimation or a reference book on GLMs. 


Another class of models where residuals are often used as a diagnostic 
are survival data models; these specialized residuals are presented in 
section 21.4.4. 


13.8.4 Model-specification tests 


Most estimation commands include a test of overall significance in the 
header output above the table of estimated coefficients. This is a test of joint 
significance of all the regressors. Some estimation commands provide 


further tests in the output. For example, the mixed command includes an LR 
test against the linear regression model; see sections 6.7 and 22.6. 


More tests may be requested as postestimation commands. Some 
commands such as linktest, to test model specification, are available after 
most commands. More model-specific commands begin with estat. For 
example, the poisson postestimation command estat gof provides a 
goodness-of-fit test. 


Discussion of model-specification tests is given in chapter 11 and in the 
model-specific chapters. 


13.9 Clustered data 


In many microeconometrics applications, observations can be grouped into 
clusters, with observations in the same cluster correlated, while observations 
in different clusters are uncorrelated. For linear models, the various 
estimation methods and methods for computing standard errors when there is 
within-cluster correlation were presented in detail in sections 6.4—6.7. We 
now briefly present extension of these methods to nonlinear models, with 
emphasis on methods for parametric models such as logit, probit and 
Poisson. These methods are presented further in many of the subsequent 
models chapters, and similar methods are presented in greater detail for 
panel data where clustering is on the individual; see chapter 22. 


The Poisson model is used as illustration. The analysis here generally 
extends to other nonlinear models, with two notable exceptions. First, for the 
Poisson, the population-averaged and random-effects (RE) estimators have 
the same probability limits. Second, for the Poisson, it is possible to obtain 
consistent estimates of the fixed-effects (FE) model when there are few 
observations per cluster. 


13.9.1 Clustered dataset 


We use the same Vietnamese dataset and variables as used in chapter 6; see 
section 6.4.1. The dependent variable, the number of direct pharmacy visits 
(pharvis), is a count variable, so Poisson regression is most appropriate. The 
regressors are the logarithm of household medical expenditures (1nhhexp) 
and the number of illnesses (illness). Clustering is on the commune variable 
that identifies the 194 separate villages. 


. * Read in Vietnam clustered data and delete one household in two communes 
. qui use mus206vlss, clear 


. drop if lnhhexp > 2.579681 & Inhhexp < 2.579683 
(12 observations deleted) 


. drop if missing (lnhhexp) 
(1 observation deleted) 


. Summarize pharvis lnhhexp illness commune 


Variable Obs Mean Std. dev. Min Max 
pharvis 27,753 .5110439 1.312606 (0) 30 
lnhhexp 27,753 2.60262 . 6245493 .0467014 5.405502 
illness 27,753 .6218427 .8995018 (0) 9 
commune 27,753 101.514 56.27264 1 194 


13.9.2 Pooled or population-averaged models 


For individual ; in cluster g (of G clusters), a parametric model for the 
conditional density of Yig given regressors is typically of the single-index 
form 


Piel sig) =F OTR Os: ta Wee a Ney FS le eG 


where we have separated out the intercept from other regressors and Y 
denotes additional parameters such as variance parameters. 


The pooled or population-averaged approach continues to assume that 
this conditional density for a single observation is correct, despite within- 
cluster correlation, but corrects standard errors for within-cluster correlation. 


Pooled Poisson 


The pooled Poisson estimator is the usual Poisson estimator, with cluster— 
robust standard errors used rather than default or heteroskedastic—robust 
standard errors. We have 


. * Poisson estimation with cluster--robust standard errors clustered on commune 
. poisson pharvis Inhhexp illness, nolog vce(cluster commune) 


Poisson regression Number of obs = 27,753 
Wald chi2(2) = 591.84 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -26217.797 Pseudo R2 = 0.1769 


(Std. err. adjusted for 194 clusters in commune) 


Robust 
pharvis | Coefficient std. err. z P>l|z| [95% conf. interval] 
lnhhexp .0175333 .0472699 0.37 0.711 -.075114 . 1101806 
illness 6806127 .0286861 23.73 0.000 . 624389 . 7368365 
_cons -1.412108 . 1451221 -9.73 0.000 -1.696542 -1.127674 


Controlling for within-commune correlation leads to larger standard errors, 
smaller ¢ statistics, and larger p-values. From output not given, the default 
standard errors for variables 1nhhexp and illness are, respectively, 0.0135 
and 0.0053, whereas the heteroskedastic—robust standard errors are 0.0268 
and 0.0264. 


Generalized estimating equations 


As detailed in section 13.3.1, consistency of the Poisson quasi-MLE requires 
only that the conditional mean be correctly specified; the distribution need 
not be the Poisson. This robustness property is shared by a range of 
commonly used models, including logit and probit, that the statistics 
literature calls GLMs. 


For clustered data, the GLM framework is extended to the generalized 
estimating equations framework. This provides feasible generalized least- 
squares estimation of nonlinear models given specification of a model for the 
within-cluster correlation to obtain potentially more efficient parameter 
estimates. One can again obtain cluster—robust standard-error estimates that 
guard against misspecification of the model for within-cluster correlation, 
provided there are many clusters. 


A natural model for within-cluster correlation is a model of 
equicorrelation with Cor(yig, yjg) = p for individuals ; and j in the same 
cluster. This assumption is also called one of exchangeability. 


The model can be fit using the xt gee command detailed in 
section 22.4.4. We have 


. * Generalized estimating equation estimation with cluster--robust st. errors 
> clustered on commune 
. xtset commune 


Panel variable: commune (unbalanced) 


. xtgee pharvis lnhhexp illness, nolog family(poisson) 
> corr(exchangeable) vce(robust) 


GEE population-averaged model Number of obs = 27,753 

Group variable: commune Number of groups = 194 
Family: Poisson Obs per group: 

Link: Log min = 51 

Correlation: exchangeable avg = 143.1 

max = 206 

Wald chi2(2) = 515.86 

Scale parameter = 1 Prob > chi2 = 0.0000 


(Std. err. adjusted for clustering on commune) 


Robust 
pharvis | Coefficient std. err. Zz P>lz| [95% conf. interval] 
lnhhexp -.0305989 .0078615 -3.89 0.000 -.0460072 -.0151906 
illness . 2697244 011886 22.69 0.000 . 2464282 . 2930207 
_cons .9510729 0793079 11.99 0.000 . 7956323 1.106513 


The parameter estimates in this example with a very simple model are much 
more precisely fit and considerably different from those obtained using the 
poisson command. This difference is due to the very high estimate of 
within-cluster correlation: the postestimation command estat wcorrelation 
yields p = 0.879. The use of the vce (robust) option following an xt 
command leads to cluster—robust standard errors being reported. In this 
example, there is a substantial efficiency gain. 


13.9.3 RE and mixed models 


An RE model allows the intercept to vary across cluster, so the conditional 
density for observations in the gth cluster becomes 


Ff (Yig/Xig, Mg) = f(g + Xi,8,); VS Mice a lV gs g=1,...,G (13.14) 


The standard assumption is that “g is normally distributed with mean 0 and 
variance gĉ. 


It is assumed that observations within cluster are independent given 
addition of the cluster-specific intercept “yg. It follows that the joint density 
for all observations in cluster g is the product of the individual densities. 
Integrating out &%g yields 


f (Yig, pani :YN,glX1g, sis ,XN, g B, Y, 7%) dag) 


al m» F (Yig|Xig, gB, »} b(ag|o2)dag (13.15) 


where (aglo?) is the N(0, 02) density. This one-dimensional integral is 
computed using Gauss—Hermite quadrature. 


For the Poisson RE model, Yig has Poisson distribution with mean 
exp(xj,3 + ag), where &g is N(0, 02) distributed. We use the xtpoisson 
command, a command with syntax similar to the xt reg command, with the 
re and normal options to obtain 


. * RE estimation with cluster--robust st. errors clustered on commune 
. xtpoisson pharvis lnhhexp illness, re normal nolog vce(robust) 


Calculating robust standard errors ... 


Random-effects Poisson regression Number of obs = 27,753 
Group variable: commune Number of groups = 194 

Random effects u_i ~ Gaussian Obs per group: 
min = 51 
avg = 143.1 
max = 206 
Integration method: mvaghermite Integration pts. = 12 
Wald chi2(2) = 1141.31 
Log pseudolikelihood = -24324.99 Prob > chi2 = 0.0000 
(Std. err. adjusted for 194 clusters in commune) 

Robust 

pharvis | Coefficient std. err. Zz P>lz| [95% conf. interval] 
lnhhexp -.1422882 . 0404454 -3.52 0.000 -.2215596 -.0630168 
illness .7151174 .0213484 33.50 0.000 - 6732753 . 7569596 
_cons -1.234728 .1177822 -10.48 0.000 -1.465577 -1.003879 
/lnsig2u - . 8734784 . 1612943 -1.189609 -.5573474 
sigma_u .6461399 .0521093 .5516703 . 7567868 


Compared with results obtained using the poisson command, the coefficient 
estimates have smaller standard errors, and the coefficient of Inhhexp is 
negative and statistically significant. 


The same results, aside from small numerical differences, are obtained 
using the mepoisson command, with only the intercept varying. 


. * Mixed estimation with cluster--robust st. errors clustered on commune 


. mepoisson pharvis lnhhexp illness || commune: , nolog vce(robust) 
Mixed-effects Poisson regression Number of obs = 27,753 
Group variable: commune Number of groups = 194 
Obs per group: 
min = 51 
avg = 143.1 
max = 206 
Integration method: mvaghermite Integration pts. = 7 
Wald chi2(2) E 1141.34 
Log pseudolikelihood = -24324.99 Prob > chi2 = 0.0000 
(Std. err. adjusted for 194 clusters in commune) 
Robust 
pharvis | Coefficient std. err. Zz P>lz| [95% conf. interval] 
lnhhexp -. 142289 . 0404443 -3.52 0.000 -.2215583 -.0630197 
illness .7151175 .0213481 33.50 0.000 .6732759 . 756959 
_cons -1.234721 .1177576 -10.49 0.000 -1.465522 -1.003921 
commune 
var (_cons) .4175234 .0673298 . 30438 .5727242 


Mixed-effects model commands such as mepoisson additionally allow 
slope coefficients to vary across clusters; see section 23.4. 


RE models that allow for endogeneity or sample selection, or both, can be 
fit using the extended regression model commands xteregress, xteprobit, 
xteoprobit, and xtintreg; see section 23.7. 


In the special case of the Poisson, an alternative RE model specifies Yig to 
be Poisson distributed with mean exp(x;,8 + ag), where exp(a,) is gamma 
distributed. In that case, there is a closed-form solution to the integral 
(13.15). This estimator, presented in section 22.6.5, is obtained using the re 
option of the xtpoisson command. 


13.9.4 FE model 


The FE model treats &g in (13.14) as a cluster-specific fixed effect that, 
unlike in the RE model, may be correlated with regressors. This then 
introduces G additional parameters @1,...,Qc¢. 


In general, consistent estimation of the parameters 8B requires consistent 
estimation of each parameter &g, which in turn requires Ng — oo. So in 
general, consistent estimation of the FE model requires many observations 
per cluster, a limitation known as the incidental parameters problem. There 
are three notable exceptions to this limitation, namely, the linear regression 
model, the Poisson model, and the logit model. For those models, one can 
consistently estimate the parameters 68 in an FE model even if there are few 
observations per cluster. 


Even when the parameters 3 can be consistently estimated, consistent 
estimates of MEs and predictions cannot be obtained if the fixed effects &g 
enter nonlinearly and are not consistently estimated. Thus, if 
E(Yig|Xig, Wg) = glag + Xj), for example, we need a consistent estimate 
of &g. At the same time, if regressors enter in this index form, then we at 
least know the relative effects of regressors, so if 8; > p, then the ME of an 
increase in the jth regressor is twice that of the kth regressor. And the sign 
of the effect can be determined if g(-) is monotonic. 


For the Poisson FE model, we use the xtpoisson command with the fe 
option to obtain 


* FE estimation with cluster--robust st. errors clustering on commune 
. xtpoisson pharvis lnhhexp illness, fe nolog vce(robust) 
note: 1 group (94 obs) dropped because of all zero outcomes 


Conditional fixed-effects Poisson regression Number of obs = 27,659 

Group variable: commune Number of groups = 193 
Obs per group: 

min = 51 

avg = 143.3 

max = 206 

Wald chi2(2) = 1125.50 


Log pseudolikelihood = -23340.71 Prob > chi2 = 0.0000 


(Std. err. adjusted for clustering on commune) 


Robust 
pharvis | Coefficient std. err. z P>l|z| [95% conf. interval] 
lnhhexp -. 1546582 .0411747 -3.76 0.000 -.2353591 -.0739573 
illness . 716148 .0215378 33.25 0.000 . 6739347 . 7583614 


In this example, the results are fairly similar to those for the Poisson RE 
model. 


An alternative brute-force estimation procedure includes indicator or 
dummy variables for each cluster as regressor. In the current example, this 
leads to the inclusion of 192 additional regressors, so to suppress output, we 
use the quietly prefix in estimation and estimates store and commands. 
We obtain 


* Dummy variables estimation with robust st. errors clustered on commune 
. gui poisson pharvis lnhhexp illness i.commune, vce(cluster commune) 


. estimates store POISSONDV 
. estimates table POISSONDV, keep(lnhhexp illness) b(%10.6f) se stats(N) 


Variable POISSONDV 
lnhhexp -0.154657 
0.041281 

illness 0.716149 
0.021594 

N 27753 


Legend: b/se 


We obtain the same results, aside from rounding error in computation, as 
those for the xtpoisson, fe command. 


Note that while this latter method can always be implemented, subject to 
computational limits if there are too many clusters, for all estimators aside 
from OLS and Poisson, consistent estimation requires many observations, say, 
at least 30, per cluster. Recent research has proposed methods that correct for 
this bias, or more precisely, the inconsistency, that arises when there are few 
observations per cluster. The community-contributed commands xtprobitfe 
and xtlogitfe (Cruz-Gonzalez, Fernandez- Val, and Weidner 2017) do so for 
probit and logit models; see section 22.4.8. 


13.9.5 Correlated RE model 
An alternative approach when fixed effects are present is to use the 


correlated RE model, introduced for linear panel models in section 8.7.4 and 
presented for nonlinear panel models in section 22.2.4. Then 


Oy x, gY + Ng, where g is a random effect and X1g denotes just those 
regressors that vary within cluster. 


For the current example, using the xtpoisson, re command, we obtain 


. * Correlated RE estimation with cluster-specific effects 
. bysort commune: egen avelnhhexp = mean(lnhhexp) 


. by commune: egen aveillness = mean(illness) 


. xtpoisson pharvis lnhhexp illness avelnhhexp aveillness, re nolog vce(robust) 


Random-effects Poisson regression Number of obs = 27,753 

Group variable: commune Number of groups = 194 
Random effects u_i ~ Gamma Obs per group: 

min = 51 

avg = 143.1 

max = 206 

Wald chi2(4) = 1691.40 

Log pseudolikelihood = -24304.828 Prob > chi2 = 0.0000 


(Std. err. adjusted for clustering on commune) 


Robust 
pharvis | Coefficient std. err. z P>|zl [95% conf. interval] 
lnhhexp -. 1550693 .041469 -3.74 0.000 - . 2363472 -.0737915 
illness . 7139736 . 0394978 18.08 0.000 . 6365593 . 791388 
avelnhhexp . 4284912 . 1016099 4.22 0.000 . 2293395 . 6276428 
aveillness . 3847358 . 1810005 2.13 0.034 .0299814 . 7394902 
_cons -2.392214 . 3382878 -7.07 0.000 -3.055246 -1.729182 


/lnalpha -1.12778 15.67529 -31.85078 29.59522 
alpha .3237511 5.074892 1.47e-14 7.13e+12 
LR test of alpha=0: chibar2(01) = 3494.51 Prob >= chibar2 = 0.000 


The coefficients of — 0.1551 and 0.7140 are very close to the Poisson FE 
estimates of — 0.1547 and 0.7161, though, unlike the linear case, they are 
not identical. Using normally distributed random effects (the re mle option 
of xtpoisson) yields very similar results. 


Some practitioners advocate this as a simple way to control for cluster- 
specific fixed effects in models for which, unlike the linear and Poisson 
models, FE estimation is not possible when there are few observations per 
cluster. 


13.10 Additional resources 


A complete listing of estimation commands can be obtained by typing help 
estimation commands. For poisson, for example, see the entries 

[R] poisson and [R] poisson postestimation and the corresponding online 
help. Useful Stata commands include predict, predictnl, margins, 


lincom, and nlcom. 


Graduate econometrics texts give considerable detail on estimation of 
nonlinear models and less on prediction and computation of MEs. See 
[R] margins, and for details on unconditional standard errors of MEs, see 
Obtaining margins with survey data and representative samples and 
Methods and formulas. 


13.11 Exercises 


— 


. Fit the Poisson regression model of section 13.3 by using the poisson, 


nl, and glm commands. In each case, report default standard errors and 
robust standard errors. Use the estimates store and estimates table 
commands to produce a table with the six sets of output, and discuss. 


. In this exercise, we use the medical expenditure data of section 3.5 


with the dependent variable y = totexp/1000 and regressors the same 
as those in section 3.5. We suppose that E(y|x) = exp(x’), which 
ensures that E(y|x) > 0. The obvious estimator is NLS, but the Poisson 
MLE is also consistent if E (y|x) = exp(x’@) and does not require that 
Y be integer values. Repeat the analysis of question 1 with these data. 


. Use the same medical expenditure data as in exercise 2. Compare the 


different standard errors obtained with poisson’s vce () option with 
the vcetypes oim, opg, robust, cluster Clustvar, and bootstrap. For 
clustered standard errors, cluster on age. Comment on your results. 


. Consider Poisson regression of docvis on an intercept, private, 


chronic, and income. This is the model of section 13.5, except female 
is dropped. Find the following: the sample average prediction of 
docvis; the average prediction if we use the Poisson estimates to 
predict docvis for males only; the prediction for someone who is 
privately insured, has a chronic condition, and has an income of 
$20,000 (so income=20); and the sample average of the residuals. 


. Continue with the same data and regression model as in exercise 4. 


Provide a direct interpretation of the estimated Poisson coefficients. 
Find the ME of changing regressors on the conditional mean in the 
following ways: the MEM using calculus methods for continuous 
regressors and finite-difference methods for discrete regressors; the 
MEM using calculus methods for all regressors; the AME using calculus 
methods for continuous regressors and finite-difference methods for 
discrete regressors; the AME using calculus methods for all regressors; 
and the MER for someone who is privately insured, has a chronic 
condition, and has an income of $20,000 (so income=20). 


. Consider the following simulated data example. Generate 100 


observations of y ~ N(0,0.0017), first setting the seed to 10101. Let 
x = \/y. Regress Y on an intercept and x. Calculate the pseudo- R2 


defined in (13.13). Is this necessarily a good measure when data are 
continuous? 

7. Consider the negative binomial regression of docvis on an intercept, 
private, chronic, female, and income, using the nbreg command 
(replace poisson by nbreg). Compare this model with one with 
income excluded. Which model do you prefer on the grounds of 1) AIC, 
2) BIC, and 3) an LR test using the 1rtest command? 


Chapter 14 
Flexible regression: Finite mixtures and 
nonparametric 


14.1 Introduction 


In this chapter, we consider regression models more flexible than the linear 
regression model for the conditional mean or the fully parametric linear 
regression with normally distributed errors. 


We begin with flexible fully parametric models for the conditional 
density of y given x based on finite mixtures of the normal distribution; see 
Ahamada and Flachaire (2010). This same approach, one that has a long 
history in statistics, can be applied to other types of data. Finite mixtures for 
a Poisson model for count data are presented in section 20.5, and many 
examples of finite mixture models (FMM) are presented in section 23.3. 


We then consider flexible methods to model the conditional mean in the 
linear model. One way to do so is to use polynomials in underlying 
variables. The most obvious way to do so is to use global polynomials that 
fit a single polynomial model throughout the range of the regressors. For 
example, a cubic model specifies E(y|x) = Bo + Bix + Box? + 83x3. 
Even more flexibility can be obtained by using higher-order polynomials, 
but these can be overly responsive to outlying observations. 


An alternative approach, widely used in applied statistics, is to flexibly 
model the conditional mean using regression splines. For example, a cubic 
spline fits separate cubic polynomials over subranges of x in such a way 
that these separate polynomials are connected (or splined) and lead to a 
continuous function for E(y|x). 


While we present global polynomials and splines for models with 
conditional mean of the form z/-y, these methods generalize to other models 
for the way that regressors enter, such as logit or probit or the exponential 
conditional mean function exp(z/7¥). 


Finally, we introduce nonparametric modeling of the conditional mean 
in the case of a single regressor. Then no functional form is specified for 
E(y|x). Instead, estimation is by kernel-weighted methods such as local 
polynomial regression or by series methods such as polynomials or 
regression splines where the polynomial degree or the number of knots is 
data determined. The chapter also introduces semiparametric models where 
the conditional mean is partially parameterized. Nonparametric regression 
methods for single and multiple regressors, and semiparametric regression 
methods, are presented in much greater detail in chapter 27. 


14.2 Models based on finite mixtures 


The essential motivation behind a mixture model is that it departs from the 
assumption of a parametric data-generating process (DGP) based on a single 
distribution. By contrast, a mixture of distributions is generally more flexible 
and likely to provide a better approximation to the true unknown 
distribution. Adding components to the model may improve the statistical fit 
of the specified model. A second motivation is that a mixture model can 
capture discrete population heterogeneity by allowing the different 
components of the mixture to have different moments or moment functions. 
Finally, when mixture models are fit by maximum likelihood (ML), the 
estimators have certain optimality properties, and inference can proceed 
along well-established lines. 


14.2.1 Univariate case 


The main focus here will be on regression mixtures. But to aid intuition, we 
shall also briefly cover univariate mixtures. 


An example of a univariate two-component mixture of normals is 
aN (1,07) + (1 — 7).N(p2, 02), which mixes draws from subpopulations 
N (11,07) and N(,19, 03) in proportions 7 and 1 — 7, respectively. By 
allowing the location and scale parameters to differ across subpopulations, 
the mixture model captures discrete population heterogeneity. 


Note that the mixture model is a weighted sum of densities; it is not a 
weighted sum of random variables. 


The following example contrasts an FMM, a mixture of N (4, 1) with 
probability 0.75 and N (8,1) with probability 0.25, with a weighted sum or 
normally distributed random variables, the random variable 
Z =0.75X + 0.25Y, where X ~ N(4,1) and Y ~ N(8,1). 


. * Compare mixture of normal densities with r.v. that is weighted sum of normals 
. set obs 10000 
Number of observations (_N) was 0, now 10,000. 


. generate y_finite_mixture = rnormal(4,1) 


. replace y_finite_mixture = rnormal(8,1) if runiform()>0.75 
(2,516 real changes made) 


. kdensity y_finite_mixture, title("Finite mixture of normals") note(" ") 
. generate y_weighted_sum = 0.75*rnormal(4,1)+0.25*rnormal (8, 1) 
. kdensity y_weighted_sum, title("Weighted sum of normals") note(" ") 


. sum y_finite_mixture y_weighted_sum 


Variable Obs Mean Std. dev. Min Max 
y_finite_m~e 10,000 5.011067 2.008076 . 2235037 11.00953 
y_weighted™m 10,000 5.004159 . 7893892 2.164636 8.140759 


From the summary statistics, the two random variables appear to have the 
same mean but different standard deviations. We leave it as an exercise to 
derive the theoretical means and variances of the two random variables. 


From the first panel of figure 14.1, we see the finite mixture distribution 
is bimodal, with peaks around 4 and 8. The second panel of the figure shows 
that the weighted sum of independent normal random variables is normally 
distributed with mean of around 5. 


Finite mixture of normals Weighted sum of normals 


0 5 10 15 0 5 10 15 
y_finite_mixture y_weighted_sum 


Figure 14.1. Finite mixture compared with weighted sum of 
normal random variables 


Formally, a finite mixture of C distributions is written as 


C C 
F(yilO1,-.-,8c) = X 7; F;(yilO5), y nei OTe) 
j=l 


j=1 


Typically, the mixture components f;(-) of a mixture are in the same 
parametric family, although formally there is no requirement that they 
should be. A special case with C = 2 is 

f(y: |91, 02) =m fi (yil01) + (1 = 71) fo(y;|O2). A mixture model with a 
finite number of components is called an FMM. Because the mixture 
components are hidden, not directly observed, the model is also known as 
the latent class model. The mixture model is often interpreted as a 
representation of different “types” in the population, each type 
corresponding to a different component. 


In practice, the number of components, denoted by C, is another 
unknown parameter. Rather than directly estimating C, one fits the model 
conditionally on an assumed value of C. This leads to the problem of 
choosing the best value of C according to some criterion. 


We may think of each distribution as specific to one class. The 
probability weight of each class is a prior probability. As the sample size 


grows, we expect to get more observations from each component 
distribution, so it becomes easier to identify additional mixture components. 


14.2.2 Regression case 


In the regression, case a finite mixture of Ç distributions is written as 


C C 
fluixa bibe) = Y afilli 5), SiS sh 0< Tj <1 
j=l 


yal 
where in the simplest case the 7; are scalar parameters to be estimated. 


Further flexibility of this specification results from a generalization that 
treats the 7; as a function of observable variables, say, z. To ensure that the 


weights lie between zero and one, and sum to one, we use a multinomial 
logit model. We replace 7; by Tji, where 


exp(z;7;) 
1 +exp(z; Y2) +--+ exp(z,Yc) 


Tji = 


and we use the normalization y, = 0. The conditional mean for the ¿th 
individual is a weighted sum, 


E(y;|x:) = Sts 


where uji = E;(y;|x;) is the conditional mean of the jth group. More 
generally, the kth central moment E(y*|x;) is a weighted sum of the kth 
central moment of each component density. 


For the simpler case where 7}; = 7; for all 7, the overall marginal effect 
is a weighted sum of the marginal effects for the Ç types: 


FMMS are closely related to intrinsic classification models and models 
with clustering. Endogenous switching models in macroeconomics also have 
a finite mixture interpretation; see Frihwirth-Schnatter (2006). The mixture 
model is easier to interpret when the number of components is small and the 
components are well separated, that is, do not overlap much. However, such 
ease of interpretation will usually diminish as more components are added to 
obtain a better fit to the data. 


14.2.3 Regression example 


As an example, we generate data for a mixture of three distributions that 
depend on a single regressor x. We have 


. * DGP for regression 3-component mixture of normals: Well-separated components 
. clear all 


. set obs 10000 
Number of observations (_N) was 0, now 10,000. 


. set seed 10101 

. generate x = runiform() 

. generate y = 1 + 1*x + rnormal() 
. generate class = runiform() 


. replace y = 4 + 4*x + .8*rnormal() if class > 0.5 
(4,956 real changes made) 


. replace y = 8 + 8*x + .4*rnormal() if class > 0.8 
(2,049 real changes made) 


This leads to the following summary statistics and density. 


x Regression mixture of normals: Summary statistics and density 
summarize y x 


Variable Obs Mean Std. dev. Min Max 
y 10,000 4.965995 4.359061 -2.31192 16.81672 
x 10,000 . 4997462 . 288546 . 0000276 . 9999758 


. kdensity y, kernel(gaussian) lwidth(medthick) note(" ") scale(1.2) 


From figure 14.2, the three components are still observable, despite the 
presence of the regressor x. More generally, however, the components are 
not as well separated in the presence of regressors. 


Kernel density estimate 


Figure 14.2. Finite mixture density of three normal distributions 


If one fits a one-component normal regression by ordinary least-squares 
(OLS) to these data generated by a mixture of normals, the estimated 
conditional mean is a weighted sum of the component means; that is, 
xo = yi tjx' B; Hence, the one-component model identifies 6, which 
may not be the target parameter. The fitted one-component model is 
interpreted as the closest one-component distribution (in the Kullback— 
Leibler metric) that minimizes the expected distance from the underlying 
mixture model. 


14.2.4 Modeling considerations 


The mixture representation is invariant with respect to the labeling of the 
components. But a rule for labeling is nevertheless beneficial because it 
enables comparison of the components from specifications with different 
values of C. The criterion we will use will be the conditional mean function 
evaluated at each set of component-specific parameter values. 


As a part of a descriptive overview, it is standard to begin by plotting a 
histogram or a kernel density of the outcome of interest. A multimodal shape 
may provide a hint of the presence of mixture. However, such an eyeball 


test, while potentially useful when considering univariate mixtures, is not 
conclusive in the context of a regression model where the mixture 
hypothesis pertains not to y per se but to y|x, that is, residuals from the 
regression of y on x. Potentially, multimodality of the distribution of 
residuals can be investigated. However, a failure to detect multimodality 
does not necessarily invalidate the mixture hypothesis if the mixture 
components are not well separated. 


The typical way for a practitioner to proceed is to 1) select the mixing 
distribution; 2) fix an initial value of C; 3) fit the model using ML; 
4) increase C; 5) fit the new model; and 6) use a model-selection criterion to 
select the preferred model. 


14.2.5 The fmm prefix 


The Stata fmm prefix, whose earlier community-contributed version was due 
to Deb (2007), supports ML estimation of finite mixtures of both continuous 
and discrete distributions. The current version includes more than 15 
families of distributions. Further applications of this command are given in 
sections 23.2 and 23.3. 


In this chapter, we consider only mixtures of normal regression models. 
In that case, the applicable model command is regress, and we give 
command 


fmm # lif | [ in | [ weight | ls fmmopts | : regress depvar | indepvars | 
[ j options | 


where # refers to the number of normal components in the specification. 


For example, for a two-component normal regression model with 
dependent variable y, model regressors in global macro xlist, and constant 
class probability 7, we give command 


. * Example of command fmm for two-component normal model 
. fmm 2: regress y $xlist 


Specifying 1cprob () in fmmopts allows the latent class probabilities 7; 
to be parameterized as a function of observable variables. For example, if 


these are in global macro 1cprobzlist, we give command 


. * Example of command fmm for two-component normal with varying class 
> probabilities 
. fmm 2, lcprob($lcprobzlist): regress y $xlist 


Additional flexibility in specification can be obtained by allowing the 
specification of the latent class or regression function, or both, to depend on 
different sets of regressors for different components of the model. This 
family of models is also known as a “mixture-of-experts” model and has a 
well-established place in machine learning and artificial intelligence 
literature; see Jacobs et al. (1991) and Jordan and Jacobs (1994). 


Interpretation of estimates of FMMs can be difficult. Useful 
postestimation commands include predict, estat lcmean, margins, and 
estat lcprob. 


14.2.6 Computational considerations 


The expectation-maximization algorithm (see Cameron and Trivedi [2005, 
345—347]) is used to find starting values. Then, gradient-based optimization 
methods such as Gauss—Newton are used; see chapter 16 on numerical 
optimization. 


The startvalues() option provides several methods for specifying 
starting values. The default assigns each observation to an initial latent class 
that is determined by running a factor analysis on the specified regressors. 
Alternatively, start values can be specified using the from () option. 


Gradient-based algorithms work well when the objective function is well 
approximated by a quadratic function in the parameters. For mixture models, 
convergence cannot be guaranteed, because the likelihood function is not 
locally concave, and a warning message is given when this condition arises. 


A failure to fit a specified mixture model may occur for several reasons. 
First, the log likelihood may have several local maximums such that the 
starting values play an important role. Second, the mixture components may 
not be well separated, which would make reliable identification of a 
component difficult. Third, the specified mixture model may have one or 
more components with small associated mixture weights 7;; this would 
imply that the model is overparameterized and the parameters of at least one 
of the components cannot be identified. The possibility of 
overparameterization is greater when the model has more components and 
also when the mixture-of-experts specification is used. 


When the main motive behind the mixture model is to improve an 
approximation to the underlying distribution, a model with a small number 
of components, such as 2 or 3, has the attraction of being a parsimonious 
specification. However, when the underlying motivation 1s to finely define 
the structure in terms of “latent types,” a more highly parameterized model 
may be preferred. But the empirical success of such an exercise depends 
upon the available sample size and the extent of separation between the 


types. 


14.3 FMM example: Earnings of doctors 


We illustrate empirical application of the fmm prefix using data on the natural 
logarithm of annual earnings of doctors derived from three waves of the 
Australian longitudinal survey, Medicine in Australia: Balancing 
Employment and Life. 


14.3.1 Data and log-linear regression 


The regression model takes log of earnings (in Australian dollars) as the 
dependent variable (logyearn). The regressors are log of annual hours 
worked (logyhrs), a gender indicator (female), the presence of a child under 
5 years of age (childu5), years of work experience (expr), square of expr 
(exprsq), the number of other postgraduate qualifications (pgradoth), and 
the size of the medical practice where the doctor works (pracsize). 


. * Log earnings of doctors example: Read in data and give summary statistics 
. qui use mus21i4mabelfmm, clear 


. global xlist logyhrs female childu5 expr exprsq pgradoth pracsize 


. summarize logyearn $xlist, sep(0) 


Variable Obs Mean Std. dev. Min Max 
logyearn 5,384 11.92434 .6931311 7.824046 13.81751 
logyhrs 5,384 7.514087 -4925595 3.951244 8.556414 
female 5,384 . 4686107 .4990601 0 1 
childu5 5,384 . 1586181 . 3653535 0 1 
expr 5,384 23.26337 10.58021 0 58 
exprsq 5,384 653.1045 535.3212 0 3364 
pgradoth 5,384 .5815379 . 788196 0 4 
pracsize 5,384 3.317979 1.189747 1 5 
. tabstat logyearn, stat(mean p50 sd skewness kurtosis) 
Variable Mean p50 SD Skewness Kurtosis 
logyearn 11.92434 11.95761 .6931311 -.5159499 4.030809 


The detailed statistics for Logyearn indicate that the normal distribution is a 
reasonable starting point for these data because the distribution is reasonably 
symmetric and the kurtosis statistic of 4.03 is not too far from 3.0. A kernel 


density estimate, not given, is close to the normal, albeit with a longer left 
tail. 


A standard log-linear regression provides a benchmark for comparison 
with FMMs with more than one component. 


* Standard OLS log-earnings regression (= 1 component FMM) with robust 
> standard errors 
. regress logyearn $xlist, vce(robust) 


Linear regression Number of obs = 5,384 
F(7, 5376) = 431.63 
Prob > F = 0.0000 
R-squared = 0.4456 
Root MSE .51643 

Robust 
logyearn | Coefficient std. err. t P>|t| [95% conf. interval] 
logyhrs . 8035603 .0267656 30.02 0.000 . 7510888 . 8560318 
female -.2607014 .0187799 -13.88 0.000 -.2975177 -.2238852 
childud . 1174038 .0199764 5.88 0.000 .0782419 . 1565656 
expr .01334 . 0026442 5.04 0.000 .0081563 .0185237 
exprsq -.0002714 . 0000541 -5.01 0.000 -.0003775  -.0001653 
pgradoth .0291406 . 0088829 3.28 0.001 .0117265 . 0465547 
pracsize .0248134 .0066149 3.75 0.000 .0118455 .0377812 
_cons 5.75751 . 2124388 27.10 0.000 5.341044 6.173976 


. estimates store fmm1 


The fit is quite good for cross-sectional earnings data, with R2 — 0.45. The 
regressors are all statistically significant at level 0.05 and have the expected 
signs. Earnings increase with experience up to a turning point at 24.7 years 

of experience. 


A graph of the kernel density of residuals from this regression (not 
reported here) is almost symmetric but with fatter tails. In contrast to the 
generated data example given previously, this residual density plot gives no 
hint of multimodality and suggests weak separation of mixture components 
in any mixture model. 


14.3.2 Two-component model estimates 


We next fit the corresponding two-component FMM. 


* Two-component mixture of (log) normals: ML estimates 
. fmm 2, nolog vce(robust): regress logyearn $xlist 


Finite mixture model 
Log pseudolikelihood = -3820.4547 


Number of ob 


s = 5,384 


Robust 
Coefficient std. err. z P>lz| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
_cons 1.145835 . 4280244 2.68 0.007 . 3069222 1.984747 
Class: 1 
Response: logyearn 
Model: regress 
Robust 
Coefficient std. err. z P>lz| [95% conf. interval] 
logyearn 
logyhrs . 3654256 .0952684 3.84 0.000 . 178703 .5521482 
female -.3316877 .0725441 -4.57 0.000 -.4738714 -. 1895039 
childu5 .0892002 .0620325 1.44 0.150 - .0323812 .2107816 
expr .0218933 .0088244 2.48 0.013 .0045979 .0391888 
exprsq - .0006673 .0001756 -3.80 0.000 -.0010114 -.0003231 
pgradoth . 104355 . 0338084 3.09 0.002 .0380917 . 1706184 
pracsize . 1144076 .0282952 4.04 0.000 .0589499 . 1698652 
_cons 8.630774 . 7334559 11.77 0.000 7.193227 10.06832 
var (e. logyearn) . 4314492 . 1020731 .2713631 .6859754 
Class: 2 
Response: logyearn 
Model: regress 
Robust 
Coefficient std. err. Zz P>|zl [95% conf. interval] 
logyearn 
logyhrs 1.018329 .0611357 16.66 0.000 . 8985049 1.138152 
female -.2171311 .0254682 -8.53 0.000 -. 2670478 -. 1672144 
childu5 . 1230955 .0211052 5.83 0.000 .08173 . 164461 
expr .0112089 .0028784 3.89 0.000 .0055674 .0168504 
exprsq -.0001405 .000055 -2.55 0.011 - .0002483 - . 0000327 
pgradoth - . 0008038 .0125525 -0.06 0.949 -.0254062 .0237986 
pracsize -.0084466 .0122798 -0.69 0.492 -.0325145 .0156214 
_cons 4.273103 . 4533413 9.43 0.000 3.38457 5.161636 
var (e. logyearn) . 1555543 .0129337 . 1321625 . 1830862 
. estimates store fmm2 


The iteration log, suppressed for brevity, shows that optimization uses the 
expectation-maximization method for the first 20 iterations before switching 
to a gradient-based algorithm for 4 iterations until convergence is achieved. 


The first set of computer output gives estimates of the logit model used 
to estimate the class probabilities. With just two components, the model is a 
binary logit model, one that is an intercept-only model because the model 
does not include individual characteristics. The logit coefficient of 1.1458 
implies that the probability of being in the second latent class is 
e1-1458 /(1 4 e1-1458) — 9.759, 


The second and third sets of output give the log-linear model estimates 
for the two latent classes. The coefficients can be interpreted as 
semielasticities or, for variables that enter in logs, as elasticities. For 
example, the elasticity with respect to hours worked is much greater in the 
second class (1.02) than in the first class (0.37). A formal significance test of 
equality of class-specific coefficients is given later in this section. 


More detailed interpretation of results follows. 
14.3.3 Predicted conditional means in each component 


The fmm postestimation predict command provides predictions for each 
observation of several quantities. We focus on prediction of the conditional 
mean (option mu), the latent class probabilities (option classpr), and the 
posterior latent probabilities (option classposteriorpr). Other options 
yield predictions of the linear predictor (x’@), density, distribution, survival 
function, and score. 


. * Two-component mixture: Predicted conditional means for each observation 
. predict lyearnhat*, mu 


. summarize lyearnhat* 


Variable Obs Mean Std. dev. Min Max 
lyearnhat1 5,384 11.74914 . 3429049 9.718239 12.68126 
lyearnhat2 5,384 11.9832 . 5590969 8.277566 13.21821 


The suffix * leads to prediction for each component. The command predict 
lyhat1, mu class (1), for example, predicts for just the first component. 


The average predicted means for the two components are 11.75 and 11.98. 
So the second component has a mean that is 0.23 higher on a log scale. The 
second component also has considerably more variability, with standard 
deviation 0.56 versus 0.34. 


The first panel of figure 14.3 provides a plot of the distribution of the 


conditional means in the two components. As already stated, the second 
component has a higher mean and variance than the first component. 
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Figure 14.3. Component-wise density of fitted values 


* Distribution of conditional means for the two components 

twoway (kdensity lyearnhat1, lwidth(medthick) clstyle(p1) note(" ")) 
(kdensity lyearnhat2, lwidth(medthick) clstyle(p2) note(" ")), 
legend(pos(11) ring(0) col(1)) legend(size(smal1l) ) 
xtitle("Conditional mean of log earnings") 
title("Conditional means") 


The postestimation command estat lcmean reports for each component 


the sample average predicted conditional mean, along with a standard error 
and associated 95% confidence interval that controls for estimation error. We 


have 


. * Two-component mixture: estat lcmean gives average predicted conditional means 
. estat lcmean 


Latent class marginal means Number of obs = 5,384 
Delta-method 
Margin std. err. Zz P>Izl [95% conf. interval] 
1 
logyearn 11.74914 .0347218 338.38 0.000 11.68108 11.81719 
2 
logyearn 11.9832 .0192156 623.62 0.000 11.94553 12.02086 


The results 11.75 and 11.98 equal the averages of the predicted conditional 
means obtained earlier. 


14.3.4 Random draws from each fitted component 


The individual observations on log-earnings are normally distributed in each 
component with means given by the preceding predict, mu command and 
variances from the fmm prefix output of, respectively, 0.4314 and 0.1555. 


To plot the predicted distributions for each component, we make a 
random draw from the normal distribution with the predicted mean 
(lyearnhat1 or lyearnhat2) and variance (0.431 or 0.156) that were 
estimated by the fmm prefix. We have 


* Random draws from the two components 
. generate randomly1 = lyearnhat1i + rnormal(0,sqrt(.4314492) ) 


. generate randomly2 = lyearnhat2 + rnormal(0,sqrt(.1555543) ) 


summarize randomly1 randomly2 


Variable Obs Mean Std. dev. Min Max 
randomly1 5,384 11.73362 7318191 8.835665 14.51798 
randomly2 5,384 11.9876 .6800995 8.361057 13.86695 


twoway (kdensity randomly1, lwidth(medthick) clstyle(p1) note(" ")) 
(kdensity randomly2, lwidth(medthick) clstyle(p2) note(" ")) 
(kdensity logyearn, lwidth(medthick) clstyle(p3) note(" ")), 
legend(pos(11) ring(0) col(1)) legend(size(small1l) ) 
xtitle("Random draw of log earnings") 
title("Random draws") 
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The second panel of figure 14.3 provides a plot of the generated densities 
of the conditional means. The actual data have distribution that lies between 
the two components. This example is a case of a mixture that is not well 
separated and hence does not deviate from unimodality in distribution. 


14.3.5 Marginal effects 


The marginal effects of changes in the regressors on log earnings can be 
obtained using the margins command. The following gives the average 


marginal effect. 


. * Two-component mixture: Average marginal effects 
. margins, dydx(*) 


Average marginal effects 
Model VCE: Robust 


Expression: Predicted mean (log of annual earnings), using class probabilities, 
predict (mu outcome (logyearn) ) 


dy/dx wrt: 


logyhrs 
female 
childu5 
expr 
exprsq 
pgradoth 
pracsize 


dy/dx 


. 8608154 
-.244768 
.1149182 
.0137865 
-.0002676 
.0245659 
.0211921 


Delta-method 
std. err. 


.0197326 
.0170957 
.0199822 
.0025413 
. 0000525 
. 0084442 
. 0063143 


Zz 


.62 
.32 
.75 
.42 
.10 
.91 
.36 


P>lzl| 


ooOoOCooCo fo 


. 000 
. 000 
. 000 
. 000 
. 000 
.004 
.001 


Number of obs = 


logyhrs female childu5 expr exprsq pgradoth pracsize 


[95% conf. 


.8221402 
-.278275 
.0757539 
. 0088056 
-.0003705 
.0080155 
.0088163 


5,384 


interval] 


. 8994905 
-.2112609 
. 1540826 
.0187674 
-.0001647 
.0411162 
. 0335679 


The average of individual marginal effects for log earnings is similar to the 
oLs coefficients of the log-linear model given earlier. The average marginal 
effect of a change in logyhrs, the most statistically significant regressor, has 
increased from 0.804 to 0.861. 


Average marginal effects for the first component can be obtained using 
the command margins, dydx(*) expression(predict (mu class(1)). For 
the second component, replace class (1) with class (2). From output not 
included, the average marginal-effects vary considerably across the two 


categories. 


Marginal effects generated from a finite mixture of normals will be a 
weighted average of the component-specific marginal effects. In a linear 
regression model, this weighted average would be exactly the same as that 
generated by the standard one-component model, Hence, if the parameter of 
interest is the aggregate marginal effect, there is no advantage in the mixture 
model. But if the interest is in group-level heterogeneity of marginal effects, 
the fmm prefix provides a useful framework. This issue is covered in greater 
detail in sections 23.2 and 23.3. 


14.3.6 Predicted class probabilities 


The classpr option of the predict command gives the class probabilities 
Tji for each individual. We obtain 


. * Two-component mixture: Predicted class probabilities for each observation 
. predict classprob*, classpr 


. Summarize classprob* 
Variable Obs Mean Std. dev. Min Max 


classprobl 5,384 . 2412507 0 . 2412507 . 2412507 
classprob2 5,384 . 7587493 0 . 7587493 . 7587493 


In this example, the class probabilities were specified to not vary with 
regressors, so they are identical across individuals. They equal 0.241 (7) and 
0.759 (1 — 7), as was manually computed earlier from the fmm prefix output. 
The class sizes are approximately 24% in class 1 and 76% in class 2. 


The estat 1lcprob postestimation command reports the sample average 
predicted class probabilities, along with a standard error and associated 
95% confidence interval that controls for estimation error. We have 


. * Two-component mixture: estat lcprob gives average predicted class probs 
. estat lcprob 


Latent class marginal probabilities Number of obs = 5,384 


Delta-method 
Margin std. err. [95% conf. interval] 


. 2412507 . 0783494 . 1208137 . 4238662 
. 7587493 . 0783494 .5761338 .8791863 


The class probabilities are statistically different from 0 and 1 at the 0.05 
level, though they are not as precisely estimated as the regression 
coefficients in each component. 


14.3.7 Predicted class posterior probabilities 


Class contrasts can be carried out also using estimates of the conditional 
probability of assigning an observation to a particular class. This can be 
done after estimating for each observation the probability of belonging to 
each latent class and then finally assigning it to the highest probability class. 


Let d;; be an indicator variable that assumes value 1 if observation ; 
belongs to class j, j = 1,...,C. The probability mass function f(d;,;|y;) of 
dij conditional on y: gives the so-called posterior probability that the ith 
observation is in class j, denoted post, ,;, with 


post;; = Pr(observation 7 belongs to class j|y;, x;) 
= Prida = Ily: X;) 
_ Mya Si (Yili, 95) 
~ 5C 
Dona Tkifk(YilXi, Ox) 


The posterior probabilities are computed using the predict command 
with option classposteriorpr, as follows: 


. * Two-component mixture: Predicted posterior class probabilities for each obs 
. predict postprob*, classposteriorpr 


. Summarize postprob* 


Variable | Obs Mean Std. dev. Min Max 
postprobi 5,384 . 2412507 . 1979325 .0046337 1 
postprob2 5,384 . 7587493 . 1979325 3.80e-19 . 9953663 


Unlike the class probabilities, the posterior probabilities vary considerably 
across individuals. At the extremes, one individual is viewed as being in the 
first latent class with probability 1.000, while another is viewed as being in 
the second latent class component with probability 0.995. 


In this example, the posterior probabilities on average equal the class 
probabilities. This is because the class probabilities do not vary with 
regressors, so Tji = Tj. Some algebra shows that then 


N C 
(1/N) Dona post;; = Tif dopa Tk = Tj 
14.3.8 Model comparison and selection 


To choose between alternative models, one can use the standard approach of 
choosing the model with the best fit to the data as measured by penalized log 
likelihood. Akaike’s information criterion (AIC) and the Bayesian 
information criterion (BIC) are commonly used information criteria; see 
section 13.8.2. In Stata, these criteria are defined as follows: 


AIC =-2lnL+2K 
BIC =-2InL+KIinN 


Smaller values are preferred. The relevant statistics are obtained using the 
estimates stats postestimation command. While each penalizes a model 
with more parameters (K), the BIC applies a stricter penalty. 


To illustrate model comparison and model selection, we fit models with 
one to four components. Adding an additional component, especially if it is 
small, may slow down convergence. A common occurrence in such a case is 
the message “not concave” in the iteration log; this indicates that the 
algorithm has encountered a segment of log likelihood where concavity fails. 
This can be a temporary feature in intermediate iterations that disappears; 
see section 16.3. However, in a highly parameterized model with too many 
components, it can be a persistent feature that indicates that the model is not 
identified; see section 16.3.5. 


* Estimate one- two- three- and four-component mixture of normals 
regress logyearn $xlist 


. qui fmm 1, vce(robust): 
. estimates store fmmi 
. qui fmm 2, vce(robust): 
. estimates store fmm2 
. qui fmm 3, vce(robust): 
. estimates store fmm3 


. estat lcprob 


regress logyearn $xlist 


regress logyearn $xlist 


Latent class marginal probabilities 


Margin 
Class 
1 . 3338411 
2 . 0493479 
3 .6168109 


Delta-method 
std. err. 


. 0333557 
.015153 
.0321315 


[95% conf. 


. 2719343 
. 0268224 
. 5522054 


Number of obs = 5,384 


interval] 


. 4020586 
.0890591 
.6775385 


. qui fmm 4, vce(robust): regress logyearn $xlist 


. estimates store fmm4 


. estat lcprob 


Latent class marginal probabilities 


Number of obs = 5,384 


Delta-method 


Margin std. err. [95% conf. interval] 
Class 
1 .0194533 . 0236252 .0017479 . 1835359 
2 . 3534858 .047536 . 2666995 .4511388 
3 . 0668228 .0317563 .0257136 . 1626805 
4 . 5602382 . 0402829 . 4804263 . 6370522 


In the fmm3 case, observe that the second component is the smallest; 
furthermore, the output is no longer ordered in terms of the size ranking of 
the component. That is, the labeling of the components is not the same as in 
the case of the fmm2 model. The fmm4 model has a component with 
probability of only 0.019. One might be concerned about overfitting, given 
the standard error of 0.024. 


We next compare the models on the basis of information criteria. 


. * Model comparison: one- two- three- and four-component mixture of normals 


. estimates stats fmmi fmm2 fmm3 fmm4 


Akaike’s information criterion and Bayesian information criterion 


Model N 11 (model) df AIC BIC 
fmm1 5,384 -4077.703 9 8173.406 8232.727 
fmm2 5,384 -3820.455 19 7678.909 7804.142 
fmm3 5,384 -3708.813 29 7475.626 7666.77 
fmm4 5,384 -3684.204 39 7446.408 7703.464 


Note: BIC uses N = number of observations. See [R] BIC note. 


According to the AIC and BIC criteria, fmm3 fits the data better than fmm2. fmm4 
does even better on grounds of AIc though not BIC. It is common to 
emphasize parsimony and use BIC, in which case the fmm3 model is preferred. 


14.3.9 Tests of equal coefficients 


To test whether individual coefficients or subsets of coefficients in different 
components are significantly different, we can apply a formal test. We 
illustrate the method by using the fmm2 model to test the equality of a single 
coefficient and then the joint equality of two coefficients. The first step is to 
run the model, and then the command fmm, coeflegend is used to collect 


variable labels. 


. * Coeflegend gives full names of regression coefficients 
. qui fmm 2, vce(robust): regress logyearn logyhrs female childu5d 


> expr exprsq pgradoth pracsize 
. fmm, coeflegend 


Finite mixture model 
Log pseudolikelihood = -3820.4547 


Number of obs = 5,384 


1.Class 


2.Class 
_cons 


Coefficient Legend 


(base outcome) 


1.145835 


:_cons] 


Class: 1 
Response: logyearn 


Model: regress 
Coefficient Legend 
logyearn 
logyhrs .3654256 _b[logyearn:1.Class#c.logyhrs] 
female -.3316877 _b[logyearn:1.Class#c.female] 
childud .0892002 _b[logyearn:1.Class#c.childu5] 
expr .0218933 _b[logyearn:1.Class#c.expr] 
exprsq -.0006673 _b[logyearn:1.Class#c.exprsq] 
pgradoth .104355 _b[logyearn:1.Class#c.pgradoth] 
pracsize .1144076 _b[logyearn:1.Class#c.pracsize] 
_cons 8.630774 _b[logyearn:1.Class] 
var (e. logyearn) .4314492 _b[/var(e.logyearn)#1.Class] 
Class: 2 
Response: logyearn 
Model: regress 
Coefficient Legend 
logyearn 
logyhrs 1.018329 _b[logyearn:2.Class#c.logyhrs] 
female -.2171311 _b[logyearn:2.Class#c. female] 
childud .1230955 _b[logyearn:2.Class#c.childu5] 
expr .0112089 _b[logyearn:2.Class#c.expr] 
exprsq -.0001405 _b[logyearn:2.Class#c.exprsq] 
pgradoth -.0008038 _bl[logyearn:2.Class#c.pgradoth] 
pracsize -.0084466 _b[logyearn:2.Class#c.pracsize] 
_cons 4.273103 _b[logyearn:2.Class] 
var (e. logyearn) .1555543 _b[/var(e.logyearn)#2.Class] 


The next step is to implement the test, here on logyhrs. 


* FMM2: Test of a single restriction across latent classes 
_b[logyearn:2.Class#c.logyhrs] ) 


. test (_b[logyearn:1.Class#c.logyhrs] 


( 1) [logyearn]1bn. 
chi2( 1) = 36.32 
Prob > chi2 = 0.0000 


Class#c.logyhrs - [logyearn]2.Class#c.logyhrs = 0 


The single restriction that logyhrs has equal coefficients across the two 
components is easily rejected. 


The identical result is obtained using the contrast command. 


. * FMM2: Same test using the contrast command 
. contrast c.logyhrs#a.Class, equation(logyearn) 


Contrasts of marginal linear predictions 


Margins: asbalanced 


df chi2 P>chi2 
logyearn 
Class#c.logyhrs 1 36.32 0.0000 
Contrast Std. err. [95% conf. interval] 
logyearn 
Class#c.logyhrs 
(1 vs 2) -.652903 . 1083366 -.8652388 -.4405673 


We next implement a joint test of whether the two variables female and 
childu5 have equal coefficients across the two classes. 


. * FMM2: Test of a joint restriction across latent classes 
. test (_b[logyearn:1.Class#c.female] = _b[logyearn:2.Class#c.female]) 
> (_b[logyearn:1.Class#c.childu5]= _b[logyearn:2.Class#c.childu5]) 


( 1) [logyearn]1bn.Class#c.female - [logyearn]2.Class#c.female = 0 
( 2) [logyearn]1bn.Class#c.childuS - [logyearn]2.Class#c.childu5 = 0 


chi2( 2) = 1.66 
Prob > chi2 0.4365 


This joint restriction has much higher p-value (0.437) and is not rejected at, 
for example, the 0.05 level. 


Overall significant differences across classes can arise from a small 
number of key differences. Section 23.3.2 shows how even greater flexibility 
in specifying mixture components can be attained by placing exclusion 
restrictions on them. 


14.3.10 More flexible FMM models 


In some cases, it may be desirable to relax the latent class restriction that the 
component weights are constant. The additional flexibility comes at a cost. 
The modeler will need to offer reasons for including a regressor in the 


probability function rather then the mean function. Sometimes, the same 
variable may appear in both functions. To illustrate the dilemma, we 
estimate a simplified variant of the fmm2 model considered earlier. 


* FMM2 model with varying mixture probabilities 
. fmm 2, lcprob(female) lcbase(2) nolog vce(robust): 
> regress logyearn logyhrs expr exprsq pgradoth pracsize 


Finite mixture model Number of obs = 5,384 
Log pseudolikelihood = -4014.1217 


Robust 
Coefficient std. err. Zz P>l|z| [95% conf. interval] 
1.Class 
female 3.229546 . 3096566 10.43 0.000 2.62263 3.836461 
_cons -.1219929 . 237477 -0.51 0.607 - . 5874393 . 3434535 
2.Class (base outcome) 
Class: 1 
Response: logyearn 
Model: regress 
Robust 
Coefficient std. err. Zz P>|zl [95% conf. interval] 
logyearn 
logyhrs .8901694 .0277857 32.04 0.000 .8357105 . 9446283 
expr .0001189 .0036869 0.03 0.974 -.0071074 .0073451 
exprsq -. 000094 . 0000867 -1.08 0.278 - .0002639 .0000759 
pgradoth . 0346003 .0124546 2.78 0.005 .0101898 .0590108 
pracsize .0492521 . 0089879 5.48 0.000 .0316361 .0668681 
_cons 4.955431 . 2040458 24.29 0.000 4.555509 5.355354 
var (e. logyearn) . 2370821 .0124561 . 2138834 . 2627969 
Class: 2 
Response: logyearn 
Model: regress 


Robust 

Coefficient std. err. Zz P>lzl| [95% conf. interval] 
logyearn 

logyhrs .5326659 . 0889074 5.99 0.000 . 3584105 . 7069212 
expr .03665 .0064563 5.68 0.000 .0239959 .0493041 
exprsq - . 0006263 .0001183 -5.29 0.000 -.0008581 -.0003944 
pgradoth .0146541 .0183619 0.80 0.425 -.0213345 .0506427 
pracsize -.042831 .0143324 -2.99 0.003 -.0709221 -.01474 
_cons 8.004132 .6879643 11.63 0.000 6.655747 9.352517 


var (e. logyearn) . 1710046 .0107656 . 1511543 . 1934618 


In this model specification, the mixing weight ~ is a function of a binary 
indicator of gender (female). The heuristic notion is that being female may 
lower the probability of being in a higher average income class perhaps 
because of gender-related wage discrimination. In this example, we treat the 
higher income group, class 2, as the base group. 


The estimates show that being female raises the probability of being in 
class 1. However, note that the interpretation of results will be more complex 
if our specification also uses female as a regressor in the mean function and, 
further, adds other variables such as childus. Unfortunately, it may not be 
easy to get clear guidance on how different factors affect the outcome, 
whether through mixing proportions, or through the conditional mean, or 
both. Hence, an improvement in the fit of the model resulting from 
parameterizing 7 can be interpreted as resulting from the use of a more 
flexible functional form. For further discussion of related issues, see 
section 23.3. 


14.4 Global polynomials 


Most regression models, either models of the conditional mean or fully 
parametric models, enter regressors as a linear combination of the form x’. 


A linear model may appear restrictive. It is linear in the parameters 8. 
Nonlinearity can be introduced, however, by setting the regressors x to be 
nonlinear transformations of the underlying variables of interest. 


In this section, we use global polynomials that fit a single polynomial 
model throughout the range of the regressors. In the subsequent section, we 
use regression splines that fit separate polynomials over subranges of x in 
such a way that these separate polynomials are connected (or splined) and 
lead to a continuous function for E(y|x). 


Polynomials and splines can be included as regressors in any type of 
regression, not just least-squares regression. In the special case of least- 
squares regression, the npregress series command can be used to fit 
polynomial and spline models. This command is deferred to section 14.6.6 
and presented in further detail in chapter 27. 


14.4.1 Global polynomials 


The following dataset generates dependent variable y as a function of the 
correlated variables x1, x2, z, and zsq. 


. * Generated data: y = 1 + 1*x1 + 1*x2 + f(z) + u where f(z) = z+ z°2 
. clear 


. set obs 200 
Number of observations (_N) was 0, now 200. 


. set seed 10101 

. generate x1 = rnormal() 

. generate x2 = rnormal() + 0.5*x1 
. generate z = rnormal() + 0.5*x1 
. generate zsq = z°2 


. generate y = 1 + x1 + x2 + z + zsq + 2*rnormal() 


summarize 


Obs 


Mean 


Std. dev 


Min 


Max 


Variable 3 i 


xi 
x2 


200 
200 
200 
200 
200 


0301211 
.0226274 
. 0664539 
1.312145 
2.164401 


.014172 
. 158216 
. 146429 
. 658477 
.604061 


wer rrer 


-3.170636 
-4.001105 
-3.386704 

.0000183 
-5.468721 


3.093716 
3.049917 

2.77135 
11.46977 
14.83116 


We consider regression of y on a polynomial function of z. Because the 
omitted variables x1 and x2 are correlated with z, the model with x1 and x2 
omitted is not necessarily quadratic in z. We fit a quartic model in z, using 
factor-variable notation. 


. * Quartic global polynomial model 
. reg y c.z#łc.zł#c.zł#c.z, vce(robust) 


Linear regression Number of obs = 200 

F(4, 195) = 96.68 

Prob > F = 0.0000 

R-squared = 0.4889 

Root MSE = 2.6028 

Robust 

y | Coefficient std. err. t P>|tl [95% conf. interval] 
z 1.398768 . 2765072 5.06 0.000 . 8534389 1.944096 
c.z#c.z . 8034603 . 2273094 3.53 0.001 .3551597 1.251761 
c.z#c.z#c.z .0918065 .0554654 1.66 0.099 -.0175826 .2011957 
c.z#c.z#c.z#c.z .0265145 .0274171 0.97 0.335 -.0275577 . 0805866 
_cons .8862917 . 2658742 3.33 0.001 . 3619334 1.41065 


The cubic and quartic terms turn out to be statistically insignificant at level 


0.05. 


The following code generates a figure that plots predictions from this 
quartic model and compares these to a plot of predictions from a quadratic 
model obtained directly using the qfit graphics command. 


. * Graph comparing quartic model predictions to quadratic model predictions 
. predict yquartic, xb 


. sort z 


. twoway (scatter y z, msize(small)) (qfit y z, lwidth(medthick)clstyle(p2)) 
> (line yquartic z, lwidth(medthick)), scale(1.2) 

> legend(pos(11) ring(0) col(1)) legend(size(small)) 

> legend(label(1 "Actual data") label(2 "Quadratic") label(3 "Quartic")) 


Figure 14.4 shows that the quartic overfits by seeking to predict well at 
the extreme values of z. 


Actual data 


Quadratic 


Quartic 


Figure 14.4. Plot of fitted quadratic and quartic models 
14.4.2 Fractional polynomials and orthogonal polynomials 


Fractional polynomials allow polynomials to be raised to noninteger powers. 
A simple example is y = 3, + bB2x1/?. 


The fp prefix fits such models. As an example, we refit the quartic 
model, in which the powers are actually the integer values 1, 2, 3, and 4. The 
fp prefix requires that the regressor be positive, so we use the scale option 
that appropriately transforms the regressor. We have 


. * Quartic model estimated using fractional polynomial command 
. fp <z>, fp(1 2 3 4) scale replace: regress y <z> 


-> regress y z_1 z_2 z_3 z_4 


Source SS df MS Number of obs = 200 
F(4, 195) 7 46.64 

Model 1263 .81882 4 315.954706 Prob > F = 0.0000 
Residual 1321 .04322 195 6.77458062 R-squared = 0.4889 
Adj R-squared = 0.4784 

Total 2584 . 86204 199 12.9892565 Root MSE = 2.6028 
y | Coefficient Std. err. t P>|t| [95% conf. interval] 

zi -5.004738 3.621983 -1.38 0.169 -12.14803 2.138552 
z_2 1.695515 2.055445 0.82 0.410 -2.358242 5.749272 
z_3 -. 2673974 . 469694 -0.57 0.570 -1.19373 . 658935 
z_4 .0265145 .0369361 0.72 0.474 - .0463311 .09936 
cons 5.287323 2.307045 2.29 0.023 . 7373604 9.837286 


. predict yfpquartic 
(option xb assumed; fitted values) 


. correlate yquartic yfpquartic 


(obs=200) 
| yquartic yfpqua~c 
yquartic 1.0000 
yfpquartic 1.0000 1.0000 


Because of the rescaling of z, the coefficient estimates differ from those 
given earlier for the quartic model, aside from that for the fourth-order term. 
But the model fit is the same and yields the same predicted values. The fp 
prefix can also include ]n x and interactions with ]n gz as regressors. And the 
mfp prefix extends to polynomials in more than one variable. 


The terms in a polynomial can be highly collinear, leading in some cases 
to numerical problems in computing an estimator. The orthpoly command 
creates, for integer-powered polynomials, a linear transformation that creates 
a set of polynomials that are uncorrelated with each other. 


For example, to transform the quartic polynomials for z to uncorrelated 
polynomials, we give the command 


. * Orthogonalize the quartic polynomials 
. orthpoly z, generate(pz*) deg(4) 


correlate pz* 


(obs=200) 
pzi pz2 pz3 pz4 
pzi 1.0000 
pz2 -0.0000 1.0000 
pz3 0.0000 0.0000 1.0000 
pz4 0.0000 0.0000 -0.0000 1.0000 


The command regress y pz* will then lead to the same fitted values of y as 
the initial quartic regression of y on the first four powers of z. 


14.5 Regression splines 


Spline regression, or piecewise regression, breaks the range of x into a few 
segments, for example, five segments, and fits a separate polynomial 
regression within each segment. This is done in a way that ensures that the 
predictions coincide at the segment boundaries, so that the predictions are 
continuous in the regressor. Leading examples are piecewise linear regression 
and cubic splines. While the discussion here considers only OLS regression, 
the splines can be used in more general models such as in logistic regression. 


Nonparametric estimation of these models using the npregress series 
command is deferred to section 14.6.6. 


14.5.1 Piecewise linear regression 


We begin with piecewise linear regression with the range of x broken into 
three segments: (—oo, c), |c, d), and [d, oo). The segment boundaries c and q 
are called knots and are specified by the researcher rather than parameters to 
be estimated. 


Then y = f(x) + u, where 
f(x) =1(@ < c)(a, +agx)+ 1(c < x < d)(ag + asr) +1(x > d)(a5 + asz) 


where 1(A) = 1 if event 4 occurs and 1(A) = 0 otherwise. 


If this regression is run, then the three separate lines will not connect at 
the knots c and d. To ensure continuity at c, so f(c—) = f(c), we need 
ay + agc = a3 + age. And to ensure continuity at d, so f(d—) = f(d), we 
need a3 + aad = a5 + agd. 


With some algebra, it can be shown that these two constraints are met if 
we define 


(14.1) 
f(z) = Bo + Bix + Bola — c)+ + B3(a — d)+ 


where (-) denotes the Heaviside function. For example, 


g-c ifg>c 
eae eee 0) x @-)=ù 0 otherwise 


Four variables (including the intercept) appear in (14.1) because two 
constraints were imposed on the original unconstrained model for f(x). An 
end-of-chapter exercise demonstrates that (14.1) yields the desired piecewise 
linear and continuous function in the simpler case of one knot. 


We apply this method for regression of y on z using the same data as 
those used in the global polynomials example. 


We specify three segments with knots at — 1 and 1 and create three 
regressors zsegl—zseg3 corresponding to x, (x — c),, and (x — d) in 
(14.1). 


. * Create the basis function manually with three segments and knots at -1 and 1 
. generate zsegi = z 


. generate zseg2 = 0 


. replace zseg2 = z - (-1) if z > -1 
(163 real changes made) 


. generate zseg3 = 0 


. replace zseg3 = z - 1 if z > 1 
(47 real changes made) 


Regression of y on an intercept and zseg1—zseg3, and subsequent 
prediction, yields 


. * Piecewise linear regression with three sections 
. regress y zsegl zseg2 zseg3, vce(robust) 


Linear regression Number of obs 7 200 
F(3, 196) = 97.11 
Prob > F = 0.0000 
R-squared = 0.4849 
Root MSE = 2.6064 
Robust 
y | Coefficient std. err. t P>|t| [95% conf. interval] 
zsegl -1.629491 .5128884 -3.18 0.002 -2.64098 -.618003 
zseg2 2.977586 . 7302596 4.08 0.000 1.537411 4.417761 
zseg3 4.594974 . 3855389 5.37 0.000 2.908026 6.281922 
_cons -1.850531 . 7809994 -2.37 0.019 -3.390772 -.3102895 
. predict yhat 
(option xb assumed; fitted values) 
. twoway (scatter y z) (line yhat z, sort lwidth(thick)), 
> title("Piecewise linear: y=atf(z)+u") ytitle("y and f(z)") xtitle("z") 


> legend (off) 


The first panel of figure 14.5 plots f( z), the predicted value of y, against 
z. As expected, there are separate lines in the three segments, and the lines 


are connected. 
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Figure 14.5. Piecewise linear: Single regressor z without and with 
additional regressors 


The mkspline command, with the marginal option, creates the same 
linear spline variables. We have 


. * Repeat piecewise linear using command mkspline to create the basis functions 
. mkspline zmki -1 zmk2 1 zmk3 = z, marginal 


. summarize zseg1 zmk1 zseg2 zmk2 zseg3 zmk3, sep(8) 


Variable Obs Mean Std. dev. Min Max 
zsegi 200 .0664539 1.146429 -3.386704 2.77135 
zmk1 200 .0664539 1.146429 -3.386704 2.77135 
zseg2 200 1.171111 . 984493 (0) 3.77135 
zmk2 200 1.171111 . 984493 (0) 3.77135 
zseg3 200 . 138441 . 3169973 (0) 1.77135 
zmk3 200 . 138441 . 3169973 (0) 1.77135 


. regress y zmk1 zmk2 zmk3, vce(robust) noheader 


Robust 
y | Coefficient std. err. t P>|t | [95% conf. interval] 
zmk1 -1.629491 .5128884 -3.18 0.002 -2.64098 -.618003 
zmk2 2.977586 . 7302596 4.08 0.000 1.537411 4.417761 
zmk3 4.594974 . 855389 5.37 0.000 2.908026 6.281922 
_cons -1.850531 . 7809994 -2.37 0.019 -3.390772 -.3102895 


The output confirms that the mksp1ine command creates the same variables 
and leads to the same regression results. 


We now include additional regressors x1 and x2 and verify that the 
resulting prediction J z) remains a piecewise linear and continuous function 


of z. 


. * Piecewise regression with additional regressors x1 and x2 
. regress y X1 x2 zmk1 zmk2 zmk3, vce(robust) 


Linear regression Number of obs = 200 
F(5, 194) = 110.06 
Prob > F = 0.0000 
R-squared = 0.6830 
Root MSE = 2.0551 
Robust 

y | Coefficient std. err. t P>|t | [95% conf. interval] 
x1 .8776081 . 1582844 5.54 0.000 . 5654288 1.189787 
x2 . 987047 . 1327458 7.44 0.000 . 7252369 1.248857 
zmk1 -2.993494 - 4841391 -6.18 0.000 -3.948346 -2.038642 
zmk2 3.886034 - 6262328 6.21 0.000 2.650936 5.121133 
zmk3 4.665117 - 6898819 6.76 0.000 3.304485 6.025748 
_cons -2.882261 .6715345 -4.29 0.000 -4.206706 -1.557815 

. generate partresid = y - _b[_cons] - _b[x1]*x1 - _b[x2]*x2 


. generate fz = _b[zmk1]*zmk1 + _b[zmk2]*zmk2 + _b[zmk3]*zmk3 
. twoway (scatter partresid z) (line fz z, sort lwidth(thick)), xtitle("z" 


> title("Piecewise linear: y=a+b*x1+c*x2+f(z)+u") legend(off) 
> ytitle("Partial residual and f(z)") 


The second panel of figure 14.5 confirms that a z) is a piecewise linear and 
continuous function of z. 


In general, if a piecewise linear and continuous function f(z) is desired 
for a regressor x and there are K knots at C1,...,CK, then 


f(z) = f(x) = Bo + Bix + Bole —c1)4 ++ + Be4i(e cer) 


A common procedure is to use equal numbers of observations in each 
segment. For example, with three knots, set the knots at the 25th, 50th, and 
75th percentiles of x. 


14.5.2 Natural or restricted cubic splines 


The piecewise linear example can be extended to higher-order polynomials. 
A common choice is a cubic spline because it is the lowest-degree regression 


spline where the graph of f(z) on x appears to be smooth and continuous to 
the naked eye. 


For a cubic spline with K knots, we require f(x), f'(x), and f” (x) to be 
continuous at the kK knots. This imposes additional constraints to those in the 
linear case. It can be shown that we can do OLS with 


fa) T Bo ar Pix + Bar? + Bax? + Bala — c1) Aegan be+K (T = cK)? 


An end-of-chapter exercise verifies a similar result in the simpler case of a 
quadratic spline with one knot. The variables in the preceding expression are 
called basis functions. 


In the cubic case with Kg knots, there are (K + 4) basis functions, 
including the intercept. A limitation, however, is that the cubic spline overfits 
at the boundaries of the data. 


A natural spline or restricted spline is an adaptation that restricts the 
relationship to be linear past the lower and upper knots. This reduction from 
cubic to linear imposes four restrictions—two at each endpoint—so in the 
cubic case with K knots, there are Ķ basis functions, including the intercept. 


Natural or restricted cubic splines can be created using the cubic option 
of the mkspline command. We consider an example with five knots. Using 
the knots (5) option, we see the default to the five knots at the 5, 27.5, 50, 
72.5, and 95 percentiles of the regressor. The smallest and largest knots are 
specified to be relatively close to the boundaries of the data, so that the linear 
restriction applies to a relatively small amount of the data. 


With five knots, we obtain 


. * Natural or restricted cubic spline regression of y on z 
. mkspline zspline = z, cubic nknots(5) displayknots 


knot1 knot2 knot3 knot4 knot5 


Zz -1.707005 -.6282565 . 0096633 8139451 1.889486 


. regress y zspline*, vce(robust) 


Linear regression Number of obs = 200 
F(4, 195) = 72.39 
Prob > F = 0.0000 
R-squared = 0.4846 
Root MSE = 2.6138 

Robust 
y | Coefficient std. err. t P>|t| [95% conf. interval] 
zsplinel -1.577777 . 5393653 -2.93 0.004 -2.641515 -.5140385 
zspline2 8.974681 3.677336 2.44 0.016 1.722224 16.22714 
zspline3 -32.6592 20.52346 -1.59 0.113 -73.13566 7.817256 
zspline4 46.24063 31.10605 1.49 0.139 -15.10685 107.5881 
_cons -1.745086 . 8845348 -1.97 0.050 -3.489569  -.0006031 


The knots correspond to z values of — 1.707,..., 1.889. 


We then create a plot of the fitted values against z. 


* Plot the predicted values from natural cubic spline regression 
. predict yhatnatural 
(option xb assumed; fitted values) 


twoway (scatter y z) (line yhatnatural z, sort lwidth(thick)), 
> title("Natural cubic spline: y=atf(z)+u") xtitle("z") 
> ytitle("f(z)") legend(off) 


The first panel in figure 14.6 displays a linear relationship for values of z 
less than the smallest knot (—1.71) and greater than the largest knot (1.89) 
and a nonlinear relationship between these two values. 
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Figure 14.6. Natural cubic spline and smoothing spline 
14.5.3 Smoothing splines 


The preceding splines are special cases of linear basis expansions in the 
regressor, with 


where h(x), m=1,...,M are M functions of zx. 


Regression splines and natural splines require specifying the knots. 
Smoothing splines use all distinct values of x as knots but then add a 
smoothness penalty that penalizes curvature. The function f(-) minimizes 


> ty =f (ei) + a f reat 


The parameter à determines the amount of smoothing. When ) = 0, the fitted 
f(x) connects the data points, while A = oo yields OLS. 


The gam command (Royston and Ambler 1998), presented in section 27.8, 
uses smoothing splines. We implement this for regression of y on z. The 


degree of smoothing is determined by the effective degrees of freedom, 
which are inversely related with A. We set the effective degrees of freedom to 
4 to impose similar flexibility to the preceding natural cubic spline regression 
that included 4 regressors aside from the intercept. We have 


. * Smoothing spline regression of y on Zz - requires Windows version of Stata 
. gam y z, df(4) 


200 records merged. 


Generalized Additive Model with family gauss, link ident. 


Model df = 5.006 No. of obs = 200 
Deviance = 1329.55 Dispersion = 6.81841 
y df Lin. Coef. Std. Err. Zz Gain P>Gain 
Zz 4.006 1.718039 . 1614613 10.641 70.867 0.0000 
-cons 1 2.1644 . 18464 11.722 
Total gain (nonlinearity chisquare) = 70.867 (3.006 df), P = 0.0000 
. display "R2 = " 1 - e(disp)*e(tdf) / 2584.6 


R2 = .48558808 


. twoway (scatter y z) (line GAM_mu z, sort lwidth(thick)), 
> title("Smoothing spline: y=at+f(z)+u") xtitle("z") 
> ytitle("f(z)") legend(off) 


The R2 is essentially the same as that for the natural cubic spline defined in 
section 14.5.2. 


The second panel of figure 14.6 plots the fitted values from the 
smoothing spline regression against z. The fitted curve is very close to that 
from natural cubic spline regression. 


Other basis functions have been proposed, including wavelets and B- 
splines. The bspline (#) option of the npregress series command uses B- 
splines, and the community-contributed bspline command (Newson 2012) 
enables generation of a range of bases of splines. 


14.6 Nonparametric regression 


Spline regression or piecewise regression breaks the range of x into a few 
segments, for example, five segments, and fits a separate polynomial 
regression within each segment. 


Nonparametric regression methods similarly perform a number of 
separate regressions, but they use many more regressions. Specifically, y is 
predicted at many values Xo of x using local regression that places greatest 
weight on data points in the neighborhood of Zo. 


To enable decent prediction when there are few data points, one can base 
the local regression on a large neighborhood of zo, one much wider than the 
distance between successive evaluation points zo. So many overlapping 
weighted regressions are used. 


The methods are called nonparametric because under suitable 
assumptions, they provide consistent estimates of the conditional mean 
function E;(y|x) as N — oo, regardless of the true functional form for 


E(y|z). 


Plots based on nonparametric regression were presented in section 2.6.6. 
We repeat some of that material here and present further detail. The focus is 
on kernel-weighted polynomial regression for a single regressor. Extension 
to multiple regressors is given in chapter 27. 


14.6.1 Local regression 


Consider the regression model y = m(a) + u, where z is a scalar and the 
conditional mean function m(-) is not specified. The goal is to estimate m/(-) 
. A local regression estimate of m(x) at £ = Zo is a local weighted average 
of yi, i = 1,..., N, that places great weight on observations for which 7; is 
close to £o and little or no weight on observations for which z£; is far from £o 
. Formally, 


where the weights w(x;, £o, h) sum over 7 to one and decrease as the 
distance between 2; and Zo increases. More weight is placed on observations 
for which z; is close to £o as the parameter h, called a bandwidth parameter, 
increases. 


The prediction M(xo) is evaluated at a range of values of xo; an obvious 
choice is to do so at each distinct value taken by x. It may seem that there 
are then too few local data points available to obtain a decent fit. This is 
avoided by using weights w(x;, zo, h) that average over much more than a 
small fraction of the data. Essentially, rolling windows are used to evaluate 
M(xo) at a range of values of Xo. 


14.6.2 Nearest-neighbors regression 


The k-nearest-neighbors estimator uses just the k observations on y: for 
which x; is closest to zo and equally weights these k values of Yi. 


This estimator can be obtained using the community-contributed knnreg 
command (Salgado-Ugarte, Shimizu, and Taniuchi 1996). 


The k-nearest-neighbors estimator is most often used as a matching 
method for discriminant analysis and for treatment-effects estimation. 


14.6.3 Local polynomial regression 


Kernel regression at x = Xo uses the weight 


_ _ Ki(#i = 20)/h} 
ily K{ (xi — 0)/h} 


w(x, Lo, h) 


where K(-) is a kernel function defined in section 2.6.4 and detailed in 
section 27.2.3. For example, the kernel (epan2) option sets 
K (z) = (3/4)(1 — 2?) if |z| < 1 and K (z) = 0 otherwise. 


The kernel regression estimate at x = Xo can equivalently be obtained by 
minimizing 


which is weighted regression on a constant where the kernel weights are 
largest for observations with 7; close to £o. Then M(zo) = Qo. This 
estimator is also called the (kernel-weighted) local constant estimator. 


The (kernel-weighted) local linear estimator of m/(-) additionally 
includes a slope coefficient and at £ = £o minimizes 


N 


X w(zi, 0, h) x {yi — ao — Bolas — zo) }? (14.2) 


wl 


Again, m(29) = Qo. This estimator has the advantage of better estimation of 
m(zxo) at values of Xo near the endpoints of the range of x because it allows 
for any trends near the endpoints. 


More generally, the local polynomial estimator of degree P uses a 
polynomial of degree p in (x; — xo) in (14.2). 


14.6.4 Local linear regression using Ipoly and npregress kernel 
The 1poly command was introduced in section 2.6.6. We use it to perform 


local linear regression of y on z. The at (z) option is used to obtain m(z) at 
each sample value of z, and the generate () option is used to store the 


predictions in variable yhat1poly. We use the epan2 kernel with default 
bandwidth determined by a plugin formula. 
. * Local linear using lpoly 


. lpoly y z, degree(1) at(z) generate(yhatlpoly) kernel (epan2) 
> title("Local linear using lpoly") 


The first panel in figure 14.7 presents a plot of the data and the 
predictions m(z). The conditional mean function is similar to a quadratic 


function. 
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Local-linear estimates 
kernel = epan2, degree = 1, bandwidth = .61 kernel = epan2 bandwidth = .3973208 


Figure 14.7. Local linear regression using 1poly and npregress 
kernel 


An alternative command is the npregress kernel command, which is 
presented in detail in chapter 27. This command fits local constant and local 
linear models. It uses cross-validation (see section 27.2.4) to obtain the 
bandwidth parameter, rather than a plugin formula. 


The npregress kernel command for local linear regression of y on z 
yields the following output: 


. * Local linear using npregress kernel 
. npregress kernel y z, estimator(linear) kernel(epan2) nolog 


> vce(bootstrap, seed(10101) reps(400) nodots) 
Bandwidth 
Mean Effect 

Zz . 3973208 . 539756 
Local-linear regression Number of obs = 199 
Kernel : epan2 E(Kernel obs) = 79 
Bandwidth: cross-validation R-squared = 0.5057 

Observed Bootstrap Percentile 

y estimate std. err. z P>lz| [95% conf. interval] 
Mean 

y 2.165175 . 2506887 8.64 0.000 1.671694 2.638386 
Effect 

Zz 2.009928 . 3806551 5.28 0.000 1.286672 2.869827 


Note: Effect estimates are averages of derivatives. 


. npgraph, title("Local linear using npregress") 


The output reports the sample mean of the predictions m(z;). As explained 
in section 27.2, fitting a local linear model, rather than a local constant 
model, enables one to also estimate the marginal effects m’(z;). The output 
also reports the sample mean of the marginal effects m/(z;). And the output 
also presents the bandwidths used to estimate m(z;) and m’(z;). 


The second panel in figure 14.7, obtained using the npgraph 
postestimation command, presents a plot of the data and the predictions 
m/z). The fitted curve is more variable because the bandwidth used by the 
npregress kernel command, one determined by cross-validation, is smaller 
than the bandwidth used by the 1poly command, one determined by a plugin 
formula. The npregress kernel command used 199 observations rather than 
the original 200 observations because local linear methods require at least 
two observations within the interval defined by the bandwidth, and this was 
not possible for the smallest value of z, as can be seen in the second panel of 
the figure. 


14.6.5 lowess 


The locally weighted scatterplot smoothing estimator (lowess), which we 
introduced in section 2.6.6, 1s a variation of the local linear estimator. 
Lowess uses a variable bandwidth, a tricubic kernel, and downweights 
observations with large residuals (using a method that greatly increases the 
computational burden). 


We obtain predictions from the preceding npregress command and from 
the Lowess command and compare these with predictions from the 1poly 
command and the actual data on y. 


* Compare predicted conditional means from lpoly, npregress, and lowess 
. predict yhatnp, mean // Prediction from npregress 
(1 missing value generated) 


. lowess y z, generate(yhatlowess) 


sum y yhatnp yhatlpoly yhatlowess 


Variable Obs Mean Std. dev. Min Max 
y 200 2.164401 3.604061 -5.468721 14.83116 
yhatnp 199 2.165175 2.566053 .0537626 14.90781 
yhatlpoly 200 2.230832 2.538389 . 2027314 14.27889 
yhatlowess 200 2.470852 2.442373 .4760757 13.25927 
correlate y yhatnp yhatlpoly yhatlowess 
(obs=199) 
y  yhatnp yhatlp”y yhatlo’s 
y 1.0000 
yhatnp 0.7112 1.0000 
yhatlpoly 0.7020 0.9962 1.0000 
yhatlowess 0.6990 0.9920 0.9976 1.0000 


The output shows the greater variability in the predictions from npregress 
kernel compared with lpoly, due to the smaller bandwidth. All three 
predictions are highly correlated with each other. The correlation of the 
lpoly predictions with y corresponds to R2 — 9.70202 = 0.493, compared 
with 0.485 using a natural cubic spline with 5 knots and 0.489 using a 
quartic global polynomial. 


14.6.6 Series estimation using npregress series 


Models with global polynomials of specified degree were introduced in 
section 14.4, and regression splines with specified knots were introduced in 
section 14.5. 


These models can be fit using the npregress series command, which is 
presented in detail in section 27.3. The polynomial degree and the number of 
knots for splines can be specified, or they can be data determined using 
cross-validation (see section 27.2.4) or information criterion measures. 


The following code fits a fourth-degree polynomial model and a third- 
order regression spline with two knots. 


. * Examples of the npregress series command with specified degree and knots 
. npregress series y Z, polynomial (4) 


(output omitted ) 
. npregress series y Zz, spline knots(2) 


(output omitted ) 


More generally, the polynomial degree and the number of knots may be 
data determined. The default is to determine these using cross-validation. 


. * Examples of the npregress series command with data-determined degree and knots 
. npregress series y Z, polynomial 


(output omitted ) 
. npregress series y Z, spline 


(output omitted ) 


14.7 Partially parametric regression 


Extending nonparametric regression to a general k-dimensional regression 
is difficult because of the “curse of dimensionality”, which essentially 
implies that in a high-dimensional regression, we can expect to encounter 
many empty subspaces, regions with no observations, unless the sample 
size 1s very large. Then we have an insufficient number of observations to 
fit a nonparametric regression. The required sample size increases 
exponentially with the dimension, so nonparametric regression is generally 
not practicable for relationships with many covariates. 


The fully nonparametric approach can be used if one has just one or a 
few regressors, which is often the situation in which exploratory regressions 
are used. Chapter 27 presents the npregress command in considerable 
detail when there are multiple regressors. 


Alternatively, one might take a semiparametric approach that applies 
nonparametric methods to part of the model, while other parts of the model 
are parametric. 


Leading examples of semiparametric models are the following. A 
partially linear model specifies the conditional mean to be the usual linear 
regression function plus an unspecified nonlinear component, so 
E(y|x, z) = x’B + g(z). A single-index model specifies the conditional 
mean to be an unspecified function of a linear combination of the 
regressors, so E(y|x) = g(x’G), where the function g(-) is unspecified. 
And a generalized additive model specifies the conditional mean function 
of a model with K regressors to be the sum of KĶ separate unspecified 
functions for each regressor. 


Chapter 27 details these semiparametric models and presents 
community-contributed programs that fit these models. 


14.8 Additional resources 


The official Stata fmm prefix for fitting finite mixture regression models is a 
successor to the community-contributed fmm command by Deb (2007), 
which was used in the previous edition of this book. The official Stata 
version fits a more comprehensive suite of models. For example, it can also 
estimate mixtures of lognormal, Student’s ¢, gamma, Poisson, and negative 
binomial regressions. 


Regression splines are presented in James et al. (2021, chap. 7.4) and 
Hastie, Tibshirani, and Friedman (2009, chap. 5.2). Standard econometrics 
references for nonparametric and semiparametric regression are Pagan and 
Ullah (1999) and Li and Racine (2007). Chapter 27 provides much greater 
coverage of nonparametric and semiparametric regression. Quantile 
regression, also a semiparametric method, is covered in chapter 15. 


14.9 Exercises 


— 


. Using the code of section 14.2.3 as a template and setting sample size 


at 1,000, generate a well-separated mixture of t(3)-distributed 
univariate variables with mixing proportions 0.2, 0.5, and 0.5. Use the 
fmm prefix to estimate the mixture parameters. 


. Modify your code for exercise 1 to generate a two-component mixture 


of normals with regression components (811 + (6 2x) and 

(B21 + B22x) with values of component-specific parameters, regressor 
x, and g? and gŻ set at values of your own choosing. Fit the mixture 
model using the fmm prefix. Using the postestimation command 
predict, generate posterior probabilities for each component. Use the 
summarize command to verify that the mean of posterior probabilities 
is close to the estimated mixture proportions. 


. Using the generated data in question 2, estimate a pooled OLS 


regression using the regress command. How are the estimated 
parameters related to those for question 2? 


. Continuing from the previous question, use the regress postestimation 


command predict with option residuals to generate the OLS 
residuals. Use the kdensity command to plot the kernel density of the 
residuals. Check whether the plot corresponds to the DGP of a well- 
separated mixture. 


. Consider piecewise linear regression with one knot at c, with 


f(x) = Bo + Bix + Bo(a — c)+. Find functions fı (x) = a, + bix 
such that fı (x) = f(x) for x < cand fo(x) = az + box such that 
falx) = f(x) for x > c (so express a1, bı as functions of bo, 81, 82 
and similarly for az, b2). Then, show fı(c) = f2(c) so the function is 
continuous at c. 


. For simplicity, consider a quadratic regression spline, rather than a 


cubic spline, with one knot. Then 

f(x) = Bo + Bix + Box? + B4(x — c)%. Show that this formula yields 
distinct quadratic polynomials for x < cand x > c with f(c—) = f(c) 
and f’(c—) = f’(c). First, find fı (x) = a, + bix + c,x? such that 
fi(x) = f(x) for x < cand f(x) = ag + box + cox? such that 

falx) = f(a) for x > c (So express aj, b,c as functions of 


Bo, 81, B2, 83 and similarly for az, b2,c2). Then, show f(c) = fo(c) 
and fi(c) = f(c). 

. Use mus214mabelfmm.dta. Obtain a graph with a scatterplot of 
logyearn against logyhrs and a fitted linear spline where the spline 
knots are created using command mkspline with option pctile and 
the number of knots is set to 5. Does the relationship appear to be 
linear? Obtain a similar graph for a fitted cubic spline with knots 
created using command mkspline with options cubic and knots 5. 
Comment on any difference in the fitted curves. Are the same knots 
used? Hint: Use option displayknots of command mkspline. 

. Use mus214mabelfmm.dta, and create a subsample using commands 
set seed 101 and keep if runiform() < 0.02. Provide local linear 
nonparametric graphs with the epan2 kernel of logyearn against 
logyhrs using the 1poly command and using the npregress kernel 
and npgraph commands. Does the relationship appear to be linear? 

. Continue with the previous question. Use the bwidth() option of the 
lpoly command to set the bandwidth to be equal to that used by the 
npregress kernel command. Do the plots appear to be the same? Are 
the predictions from 1poly and npregress kernel identical? 


Chapter 15 
Quantile regression 


15.1 Introduction 


The standard linear regression is a useful tool for summarizing the average 
relationship between the outcome variable of interest and a set of 
regressors, based on the conditional mean regression function E(y|x). This 
provides only a partial view of the relationship. A more complete picture 
would provide information about the relationship between the outcome y 
and the regressors x at different points in the conditional distribution of y. 
Quantile regression (QR) is a statistical tool for building just such a picture. 


Quantiles and percentiles are synonymous—the 0.99 quantile is the 99th 
percentile. The median, defined as the middle value of a set of ranked data, 
is the best-known specific quantile. The sample median is an estimator of 
the population median. If F(y) = Pr(Y < y) defines the cumulative 
distribution function (c.d.f.), then F (ymea) = 1/2 is the equation whose 
solution defines the median y,,.g = F—'(1/2), or more precisely, the 
infimum of F~!(1/2) to ensure uniqueness. The quantile 4, q € (0, 1), is 
defined as that value of y that splits the data into the proportions q below 
and 1 — q above; that is, F (yq) = q and yg = F~} (q). For example, if 
Yo.99 = 200, then Pr(Y < 200) = 0.99. In practice, F'(y) may have flat 
sections and jumps, so a more precise statement is that 
Yq = inf{y € R : q < Fly)}. 


These concepts extend to the conditional quantile regression (CQR) 
function, denoted as Q,(y|x), where the function Q,(y|x) is usually 
specified to be linear in parameters. 


cars have considerable appeal for several reasons. Median regression, 

also called least-absolute-deviations regression, is a special case of CQR and 
is more robust to outliers than is mean regression; see section 3.6.5. CQR, as 
we shall see, permits us to study the impact of regressors on both the 
location and scale parameters of the model, thereby allowing a richer 
understanding of the data. The approach is semiparametric in the sense that 
it avoids assumptions about the parametric distribution of regression errors. 
It is also robust with respect to departures from normality. 


QR is increasingly used to study distributional issues, such as the impact 
of a policy treatment on various quantiles of the distribution of the outcome 
variable. For example, one may be interested in the effect of training on 
earnings at various points of the earnings distribution while also controlling 
for individual characteristics, an example of heterogeneous treatment 
effects. Furthermore, interest may lie in the impact on unconditional 
quantiles, in addition to the impact on conditional quantiles, and alternative 
unconditional QR methods are used. And in many applications, the policy 
treatment may be endogenous. 


This chapter explores the application of CQR using several Stata 
commands. QR with an endogenous regressor and unconditional QR, more 
advanced topics that often use the treatment evaluation framework, are 
presented in chapter 25. 


15.2 Conditional quantile regression 


In this section, we review the theoretical background of CQR analysis. 


Let e; denote the model prediction error. Then ordinary least squares 
(OLS) minimizes $`, e?, and median regression minimizes 7, |e;|. As 
explained below, conditional QR at the qth quantile minimizes a sum that 
gives the asymmetric penalties (1 — q)|e;| for overprediction and gq|e;| for 
underprediction. Linear programming methods need to be used to obtain the 
conditional QR estimator, but it is still asymptotically normally distributed 
and is easily obtained using Stata commands. 


15.2.1 Conditional quantiles 


Many applied econometrics studies model conditional moments, especially 
the conditional mean function. Suppose that the main objective of modeling 
is the conditional prediction of y given x. Let y(x) denote the predictor 
function and e(x) = y — y(x) denote the prediction error. Then, 


L {e(x)} = Lty— y(x)} 


denotes the loss associated with the prediction error e. The optimal loss- 
minimizing predictor depends upon the function L(-). If L(e) = e?, then the 
conditional mean function, E'(y|x) = x’@ in the linear case, is the optimal 
predictor. If the loss criterion is absolute error loss, so L(e) = |e], then the 
optimal predictor is the conditional median, denoted by med(y|x). If the 
conditional median function is linear, so that med(y|x) = x’, then the 
optimal predictor is 7 — x! 3: where ĝ is the least-absolute-deviations 
estimator that minimizes $`; |y; — x; 6|. 


Both the squared-error and absolute-error loss functions are symmetric, 
which implies that the same penalty is imposed for prediction error of a 
given magnitude regardless of the direction of the prediction error. The 
asymmetry parameter q is specified. It lies in the interval (0, 1) with 


symmetry when q = 0.5 and increasing asymmetry as q approaches 0 or 1. 
Then the optimal predictor is the qth conditional quantile, denoted by 
Qq(y|x), and the conditional median is a special case when q = 0.5. CQR 
involves inference regarding the conditional quantile function. 


Standard CQR analysis assumes that the conditional QR Q,(y|x) is linear 
in x; for nonparametric QR, see Koenker (2005). 


Quite apart from the considerations of loss function (on which 
agreement may be difficult to obtain), there are several attractive features of 
car. First, unlike the OLS regression, which is sensitive to the presence of 
outliers and can be inefficient when the dependent variable has a highly 
nonnormal distribution, the conditional QR estimates are more robust. 
Second, QR also provides a potentially richer characterization of the data. 
For example, QR allows us to study the impact of a covariate on the full 
distribution or any particular percentile of the distribution, not just the 
conditional mean. Third, unlike OLS, QR estimators do not require existence 
of the conditional mean for consistency. Finally, conditional QR is 
equivariant to monotone transformations. This means that the quantiles of a 
transformed variable y, denoted by h(y), where h(-) is a monotonic 
function, equal the transforms of the quantiles of y, so 
Qat{h(y)} = h{Q,_(y)}. Hence, if the quantile model is expressed as h(y), 
for example, In y, then one can use the inverse transformation to translate 
the results back to y. This is not possible for the mean, because 
E{h(y)} 4 h{E(y)}. The equivariance property for quantiles continues to 
hold in the regression context, assuming that the conditional quantile model 
is correctly specified; see section 15.3.4. 


15.2.2 Computation of conditional QR estimates 


In the nonregression case, the qth sample quantile yg = F'~*(q) can be 
obtained as the solution to minimizing the objective function 


N N 
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where 
palu) = ufq — 1(u < 0)} (15.2) 


is called the check function. In practice, a range of values solves this, in 
which case Yq is the smallest value because a c.d.f. is defined to be left 
continuous. 


For regression, by extension of the first equation in (15.1), the qth CQR 
estimator 8, minimizes over (7, the objective function 


N N 
S(8,)= >) adu—xiBl+ X0 (l-an - xb, (15.3) 
iyi? x, B tyi <x; B 


where 0 < q < 1, and we use 6, rather than 8 to make clear that different 
choices of q estimate different values of G. If q = 0.9, for example, then 
much more weight is placed on prediction for observations with y > x’Go g 
than for observations with y < x’Gp 9. Often, estimation sets q = 0.5, 
giving the least-absolute-deviations estimator that minimizes 


a [yi = X; Bo.5l: 


Using the third equation in (15.1), we more simply write the objective 
function (15.3) as 


Q(B,) = De Pays — x48) 


where p,(u) is the check function defined in (15.2). The objective function 
is not differentiable, so the usual gradient optimization methods cannot be 
applied. Instead, it is a linear program. The classic solution method is the 
simplex method, which is guaranteed to yield a solution in a finite number 
of simplex iterations. 


15.2.3 Computation of CQR standard errors 


The estimator that minimizes Q(G,) is an m estimator with well- 
established asymptotic properties. The CQR estimator is asymptotically 
normal under general conditions; see Cameron and Trivedi (2005, 88). 


Under the assumption of independent heteroskedastic errors, it can be 
shown that 


B, ~ N (B,,A7'BA™') (15.4) 


where A = F{ fu, (0|x:)x:x;}, B = E{q(1 — q)x;x;}, and fu, (0|x;) is 
the conditional density of the error term ugi = yi — X4 B q evaluated at 

Uqi = 0. This analytical expression involves fu, (O|x;), a quantity that can 
be estimated in several ways. The Stata default uses a function of the fitted 
values; an alternative method uses a kernel-density estimate. Other methods 
for obtaining heteroskedastic—robust standard errors are to use a paired 
bootstrap (see sections 3.4.7 and 12.3.1) or to use White heteroskedastic— 
robust standard errors. 


The result (15.4) has been extended to the clustered case by Parente and 
Santos Silva (2016). Alternatively, one can use a cluster pairs bootstrap (see 


section 12.3.5). With few clusters, a better method uses a wild gradient 
bootstrap proposed by Hagemann (2020). 


15.2.4 The qreg, bsqreg, sqreg, and qreg2 commands 


The Stata commands for CQR estimation are similar to those for ordinary 
regression. There are three variants—gqreg, bsqreg, and sqreg—that are 
commonly used. The first two are used for estimating a CQR for a specified 
value of q, without or with bootstrap standard errors, respectively. The 
sqreg command is used when several different values of q are specified 
simultaneously. A fourth command, used less frequently, is igreg, for 
interquantile regression. 


The basic QR command is greg, with the following syntax: 
qreg depvar | indepvars | [ of | lin] | weight ] [ 5 options ] 


A simple example with the greg options set to default is qreg y x z. This 
will estimate the median regression, ymea = 81 + Box + 832z; that is, the 
default q is 0.5. The quantile () option allows one to choose q. For 
example, greg y x z, quantile (.75) sets q = 0.75. Other options include 
level (#) to set the level for reported confidence intervals and an 
optimization-related option. 


The default standard errors assume independent and identically 
distributed (i.1.d.) errors, in which case the analytical formula (15.4) reduces 
to A-1, where A = D0, g(1 — q)xix,/ fi, (0). Various options can be used 
to change the method used to compute f,, (0). The vce (robust) option for 
qreg obtains White heteroskedastic—robust standard errors. 


bsqreg obtains heteroskedastic—robust standard errors by instead using 
a pairs bootstrap. The standard errors from bsqreg are heteroskedastic 
robust in the same sense as those from vce (robust) for other commands. 
The command syntax is the same as for greg. A key option is reps (#), 
which sets the number of bootstrap replications. This option should be used 
because the default is only 20. And for replicability of results, one should 


first issue the set seed command. For example, give the commands set 
seed 10101 and bsgqreg y x z, reps (400) quantile(.75). 


The iqreg command for interquantile regression reports the difference 
in coefficient estimates from separate CQR estimation at two different values 
of q. 


When QR estimates are obtained for several values of q, and we want to 
test whether regression coefficients for different values of q differ, the 
sqreg command is used. This provides coefficient estimates and an estimate 
of the simultaneous or joint variance—covariance matrix of the estimator of 
B, across different specified values of q, using the bootstrap. The command 


syntax 1s again the same as greg and bsqreg, and several quantiles can now 
be specified in the quantile () option. For example, sqreg y x z, 
quantile(.2 .5 .8) reps(400) produces QR estimates for q = 0.2, 

q = 0.5, and q = 0.8, together with bootstrap standard errors based on 400 
replications. 


The qreg2 command by Machado, Parente, and Santos Silva (2011) 
obtains heteroskedastic-robust standard errors using the analytical formula 
(15.4). 


The cluster() option of the greg2 command computes the cluster 
extension of formula (15.4) due to Parente and Santos Silva (2016). 
Alternatively, cluster—robust standard errors can be obtained by manually 
implementing a bootstrap by applying the bootstrap prefix, with option 
cluster (), to the qreg estimation command. 


15.3 CQR for medical expenditures data 


We present the basic QR commands applied to the log of medical 
expenditures. 


15.3.1 Data summary 


The data used in this example come from the Medical Expenditure Panel 
Survey and are identical to those discussed in section 3.2. Again, we 
consider a regression model of total medical expenditure by the Medicare 
elderly. The dependent variable is 1totexp, so observations with zero 
expenditures are omitted. The explanatory variables are an indicator for 
supplementary private insurance (suppins), one health-status variable 
(totchr), and three sociodemographic variables (age, female, white). 


We first summarize the data: 


* Read in log of medical expenditures data and summarize 
. qui use mus203mepsmedexp 
. drop if ltotexp ==. 
(109 observations deleted) 


summarize ltotexp suppins totchr age female white, separator (0) 


Variable Obs Mean Std. dev. Min Max 
ltotexp 2,955 8.059866 1.367592 1.098612 11.74094 
suppins 2,955 .5915398 -4916322 (0) 1 

totchr 2,955 1.808799 1.294613 (0) 7 

age 2,955 74.24535 6.375975 65 90 
female 2,955 . 5840948 . 4929608 (0) 1 
white 2,955 .9736041 . 1603368 (0) 1 


The major quantiles of 1totexp can be obtained by using summarize, 
detail, and specific quantiles can be obtained by using the centile 
command. The same estimates for a specific quantile can also be obtained by 
QR on an intercept only. We have 


* Intercept-only quantile regression gives the raw quantile 
centile ltotexp, centile(25) 


Binom. interp. 


Variable Obs Percentile Centile [95% conf. interval] 
1ltotexp 2,955 25 7.267525 7.200001 7.339215 
. qreg ltotexp, quantile(.25) nolog 
.25 Quantile regression Number of obs = 2,955 
Raw sum of deviations 1293.577 (about 7.2675252) 
Min sum of deviations 1293.577 Pseudo R2 = 0.0000 
ltotexp | Coefficient Std. err. t P>|t| [95% conf. interval] 
_cons 7.267525 .0341235 212.98 0.000 7.200617 7.334433 


The community-contributed golot command due to Cox (2005) is a 
useful visual aid because it provides a plot of all quantiles. We have 


* Quantile plot for ltotexp using the community-contributed command qplot 
. qplot ltotexp, recast(line) scale(1.5) ytitle("Quantiles of 1ln(totexp)") 
> xtitle("Fraction of the data") 


The plot, shown in figure 15.1, is the same as a plot of the empirical c.d.f. of 
ltotexp, except that the axes are reversed. We have, very approximately, 
go.1 = 6, 90.25 = T, 9o.5 = 8, Go.75 = 9, and go.9 = 10. The distribution 
appears to be reasonably symmetric, at least for 0.05 < q < 0.95. 
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Figure 15.1. Quantiles of the dependent variable 


15.3.2 Conditional QR estimates 


The basic CQR output for the median regression, with standard errors 
computed using the default option, is obtained using the greg command with 


q = 0.5. 


. * Basic quantile regression for q = 0.5 


. qreg ltotexp suppins totchr age female white 


Iteration 1: WLS sum of weighted deviations = 1400.8169 
Iteration 1: sum of abs. weighted deviations = 1400.9985 
Iteration 2: sum of abs. weighted deviations = 1399.797 
Iteration 3: sum of abs. weighted deviations = 1399.7529 
Iteration 4: sum of abs. weighted deviations = 1399.5861 
Iteration 5: sum of abs. weighted deviations = 1398.9092 
Iteration 6: sum of abs. weighted deviations = 1398.8274 
note: alternate solutions exist. 

Iteration 7: sum of abs. weighted deviations = 1398.5229 
Iteration 8: sum of abs. weighted deviations = 1398.522 
Iteration 9: sum of abs. weighted deviations = 1398.5155 
Iteration 10: sum of abs. weighted deviations = 1398.5105 
Iteration 11: sum of abs. weighted deviations = 1398.5067 
Iteration 12: sum of abs. weighted deviations = 1398.4992 
Iteration 13: sum of abs. weighted deviations = 1398 . 498 
Iteration 14: sum of abs. weighted deviations = 1398.4951 
Iteration 15: sum of abs. weighted deviations = 1398.4945 
Iteration 16: sum of abs. weighted deviations = 1398.4916 
Median regression Number of obs = 2,955 

Raw sum of deviations 1555.48 (about 8.111928) 

Min sum of deviations 1398.492 Pseudo R2 = 0.1009 
ltotexp | Coefficient Std. err. t P>|t| [95% conf. interval] 
suppins 2769771 .0535936 5.17 0.000 .1718924 . 3820617 

totchr . 3942664 .0202472 19.47 0.000 . 3545663 . 4339664 
age 0148666 0041479 3.58 0.000 . 0067335 .0229996 
female - . 0880967 . 0532006 -1.66 0.098 -.1924109 .0162175 
white . 4987457 . 1630984 3.06 0.002 . 1789474 .818544 
_cons 5.648891 . 341166 16.56 0.000 4.979943 6.317838 


The iterations here refer to simplex iterations rather than the usual Newton— 
Raphson (or related gradient-method) iterations. All regressors, aside from 
female, are highly statistically significant with expected signs. 


15.3.3 Interpretation of conditional quantile coefficients 


The key advantage of CQR is that it can accommodate heterogeneous 
treatment effects, whereby a change in a specific regressor can have different 
effects at different quantiles of the conditional distribution of the outcome of 
interest. 


For example, one more year of schooling may have a larger effect for 
someone at a low conditional quantile of earnings than for someone with 
high predicted earnings given his or her various measured characteristics. 


Such heterogeneous effects are likely even if the conditional quantile 
function is linear. To see this, consider a single regressor, for simplicity, with 
linear model 


Yi = By + Box; + ui (15.5) 


where the error u; satisfies E(y;|x;) = 0. We denote the gth conditional 
quantile function of y given x as Q,(y|x), analogous to using E(y|x) to 
denote the conditional mean. Conditioning on %;, (15.5) implies that 


Qalyilz:) = 61 + Bots + FE, (0) 


where Fuz; is the conditional distribution function of u;|x;. Conditional on 
Ti, the quantile depends on the distribution of u;|x; via the term F, bs (q). 
This will depend on <; if, for example, errors are heteroskedastic. Then, in 
general, Q,(y|x) at different values of ¢ will differ in more than just the 
intercept and may well even be nonlinear in z. 


When are the quantile effects homogeneous? In the special case that 
errors are i.i.d., considerable simplification occurs as F— |. (q) = F71(q), 


thy aes 


which does not vary with ;. Then the conditional quantile is 


Qa(yilai) = {81 + Fa (a)} + Box 


Here the conditional quantile functions at different quantiles have a common 
slope, differing only in the intercepts 8) + F>'(q). In such a simple case, 
there is no need to use CQR to obtain marginal effects (MEs) at different 
quantiles because the quantile slope coefficient 32 does not vary with the 
quantile. 


More generally, we condition on many individual characteristics. Thus, 
we may be interested in the heterogeneous effect of schooling at different 
conditional quantiles where conditioning is on many observed characteristics 
and not just on schooling. 


In the multiple regression model, the standard linear conditional quantile 
function is 


Qa(yilxi) = Xiba 


The MEs after CQR can be obtained in the usual way. For the jth (continuous) 
regressor, the ME is 


and in the discrete case is Q,(y|x + Az;) — Qg(y|x) = Bg; x Ax; . As for 
linear least-squares regression, the ME at a given conditional quantile is given 
by the slope coefficient and is invariant across individuals, simplifying 
analysis. However, the interpretation is somewhat delicate for discrete 
changes that are more than infinitesimal because the partial derivative 
measures the impact of a change in x; under the assumption that the 
individual remains in the same conditional quantile of the distribution after 
the change. For larger changes in a regressor, the individual may shift into a 
different conditional quantile. 


The preceding estimates are that the conditional median of lt otexp, 
where conditioning is on suppins, totchr, age, female, and white, 
increases by 0.277 for an individual with supplementary insurance, increases 
by 0.394 with each additional chronic condition, increases by 0.015 with 
each year of aging, and so on. 


The Stata QR commands are for models that are linear in parameters but 
not necessarily linear in variables. For example, we could include as 
regressors a quadratic in age as c.age##c.age, using factor notation, and 
obtain the ME of age after CQR using the postestimation command margins, 
dydx (*). 


15.3.4 Retransformation 


For our example, with the dependent variable 1totexp = In(totexp), the 
results from qreg give MEs for In(totexp). We may want instead to compute 
the ME on totexp, not ltotexp. 


The equivariance property of QR is relevant. Given Q,(In y|x) = x’B,, 
we have Q,(y|x) = exp{Q,(In y|x)} = exp(x’G,). The ME on y in levels, 
given QR model x’, in logs, is then 


This d pend, on x. The average marginal effect (AME) is 
IN D exp(x;,3,)}Gq; and can be estimated if we use the predict 


postestimation command to obtain exp(x 8, ) and then average. We obtain 


. * Obtain multiplier to convert QR coeffs in logs to AME in levels. 
. qui predict xb 


. generate expxb = exp(xb) 
. qui summarize expxb 


display "Multiplier of QR in logs coeffs to get AME in levels = " r(mean) 
Multiplier of QR in logs coeffs to get AME in levels = 3746.7178 


For example, the AME of totchr on In(totexp) is 0.3943 from the earlier 
qreg output above. The implied AME of totchr on the levels variable is 
therefore 3746.7 x 0.3943 = 1477. One more chronic condition increases 
the conditional median of expenditures by $1,477. 


The equivariance property that Q,(y|x) = exp{Q,(In y|x)} is exact 
only if the conditional quantile function is correctly specified. This is 
unlikely to be the case because the linear model will inevitably be only an 
approximation. One case where the linear model is exact is where all 
regressors are discrete and we specify a fully saturated model with indicator 
variables as regressors that exhaust all possible interactions between discrete 
regressors. We pursue this in the second end-of-chapter exercise. 


15.3.5 Comparison of estimates at different quantiles 


CQRs are usually performed at different quantiles. Here we do so for the 
quartiles q = 0.25, 0.50, and 0.75 and compare the results with one another 
and with OLS estimates. White heteroskedastic—robust standard errors are 
reported. We obtain 


. * Compare (1) OLS; (2-4) coeffs across quantiles .25, .50, and .75 
. qui regress ltotexp suppins totchr age female white, vce(robust) 


. estimates store OLS 

. qui qreg ltotexp suppins totchr age female white, quantile(.25) vce(robust) 
. estimates store QR_25 

. qui greg ltotexp suppins totchr age female white, quantile(.50) vce(robust) 
. estimates store QR_50 

. qui greg ltotexp suppins totchr age female white, quantile(.75) vce(robust) 
. estimates store QR_75 

. estimates table OLS QR_25 QR_50 QR_75, b(%7.3f) se 


Variable OLS QR_25 QR_50 QR_75 
suppins 0.257 0.386 0.277 0.149 
0.047 0.060 0.053 0.062 
totchr 0.445 0.459 0.394 0.374 
0.017 0.018 0.018 0.023 
age 0.013 0.016 0.015 0.018 
0.004 0.004 0.004 0.005 

female -0.077 -0.016 -0.088 -0.122 
0.046 0.053 0.054 0.061 
white 0.318 0.338 0.499 0.193 
0.136 0.097 0.193 0.257 
_cons 5.898 4.748 5.649 6.600 
0.294 0.307 0.352 0.427 


Legend: b/se 


OLS coefficients differ considerably from the CQR coefficients, even those for 
median regression. 


The CQR coefficients vary across quantiles. Most noticeably, the highly 
statistically significant regressor suppins (supplementary insurance) has a 
much greater impact at the lower conditional quantile. Note that this implies 
that suppins has a larger effect for those with log expenditure considerably 
less than that predicted given their individual characteristics. It does not 
necessarily imply that suppins has a larger effect for those with lower log 
expenditures. For example, the lower conditional quartile might actually be 
composed of people with high expenditures. 


The standard errors are smaller for median regression (q = 0.50) than 
for the upper and lower quantiles (q = 0.25, 0.75), reflecting more precision 
at the center of the distribution. 


15.3.6 Robust variance estimation using the qreg2 command 


The community-contributed greg2 command by Machado, Parente, and 
Santos Silva (2011) obtains heteroskedastic—robust standard errors using a 
consistent estimator of the analytical formula (15.4) and also obtains cluster— 
robust standard errors using the cluster extension of formula (15.4) due to 
Parente and Santos Silva (2016). It additionally provides a formal test of 
heteroskedasticity. 


The command has a format similar to the greg command. The following 
provides heteroskedastic—robust standard errors following median 
regression. 


* qreg2 provides heteroskedastic-robust standard errors and also tests for 
> heteroskedasticity 
. greg2 ltotexp suppins totchr age female white, quantile(0.5) 
Median regression 
R-squared = .1954566 
Number of obs = 2955 
Objective function = .47326279 


Heteroskedasticity robust standard errors 


ltotexp | Coefficient Std. err. t P>|t| [95% conf. interval] 
suppins .2769771 0541489 5.12 0.000 . 1708037 . 3831505 
totchr . 3942664 .0202004 19.52 0.000 . 354658 . 4338748 
age .0148666 . 0040878 3.64 0.000 .0068513 .0228819 
female - .0880967 . 0532428 -1.65 0.098 -. 1924936 .0163002 
white . 4987457 . 194073 2.57 0.010 . 1182135 .8792779 
_cons 5.648891 . 3637387 15.53 0.000 4.935683 6.362098 


Machado-Santos Silva test for heteroskedasticity 
Ho: Constant variance 
Variables: Fitted values of ltotexp and its squares 


chi2(2) = 56.274 
Prob > chi2 = 0.000 


The point estimates are identical to those from greg in section 15.3.2, and 
the robust standard errors are very similar. 


The output includes a test for heteroskedasticity that strongly rejects the 
null hypothesis of homoskedastic model errors, so that inference needs to be 
based on the computationally more intensive heteroskedastic—robust 


standard errors. The test is against heteroskedasticity of the form Var 
(u|x) = h(a, + 2’a2), where the command default is to set z; = (gi, Y2), 


A eA cael 
where Fyi = X; By 


The cluster () option of the qreg2 command provides cluster—robust 
standard errors and provides a formal test of the null hypothesis of 
intracluster correlation. 


15.3.7 Comparison of different standard-error estimates 


There are several different methods for computing QR standard errors. In 
general, it is best to use heteroskedastic—robust standard errors with 
independent data and cluster—robust standard errors with clustered data (with 
many clusters). The analytical methods to obtain these robust standard errors 
involve complications such as kernel-density estimation at zero, while 
bootstrap methods are more computationally intensive. 


The following median regression example presents default greg standard 
errors that assume 1.1.d. errors, followed by three different methods for 
computing heteroskedastic—robust standard errors, followed by two methods 
for computing cluster—robust standard errors where the clustering is on educ. 


. * Median regression: i.i.d., heteroskedastic-robust, and cluster--robust 
> standard errors 
. qui qreg ltotexp suppins totchr age female white, quantile(.50) 


. estimates store IID 

. qui qreg ltotexp suppins totchr age female white, quantile(.50) vce(robust) 

. estimates store HETWHITE 

. qui qreg2 ltotexp suppins totchr age female white, quantile(.50) 

. estimates store HETANAL 

. set seed 10101 

. qui bsqreg ltotexp suppins totchr age female white, quant(.50) reps(400) 

. estimates store HETBOOT 

. qui qreg2 ltotexp suppins totchr age female white, quant(.50) cluster(educyr) 
. estimates store CLUANAL 


. qui bootstrap, cluster(educyr) seed(10101) reps(400): 
> qreg ltotexp suppins totchr age female white, quant(.50) 


. estimates store CLUBOOT 
. estimates table IID HETWHITE HETANAL HETBOOT CLUANAL CLUBOOT, b(%7.3f) se 


Variable IID HETWH~E HETANAL  HETBOOT CLUANAL CLUBOOT 
suppins 0.277 0.277 0.277 0.277 0.277 0.277 
0.054 0.053 0.054 0.057 0.066 0.073 

totchr 0.394 0.394 0.394 0.394 0.394 0.394 
0.020 0.018 0.020 0.019 0.015 0.018 

age 0.015 0.015 0.015 0.015 0.015 0.015 
0.004 0.004 0.004 0.004 0.004 0.004 

female -0.088 -0.088 -0.088 -0.088 -0.088 -0.088 
0.053 0.054 0.053 0.052 0.051 0.063 

white 0.499 0.499 0.499 0.499 0.499 0.499 
0.163 0.193 0.194 0.199 0.181 0.174 

_cons 5.649 5.649 5.649 5.649 5.649 5.649 
0.341 0.352 0.364 0.391 0.306 0.369 


Legend: b/se 


In this example, there is at most 30% variation across the various 
standard errors. Because we use the log of medical expenditures, 
heteroskedasticity in a linear regression is expected to be greatly reduced, 
though this does not necessarily imply that heteroskedasticity in the errors 
from median regression is reduced. The cluster—robust standard errors were 
clustered on educ for illustration and were not necessarily expected to make 
a big difference compared with heteroskedastic—robust standard errors. 


15.3.8 Heteroskedasticity test based on OLS regression 


One reason for coefficients differing across quantiles is the presence of 
heteroskedastic errors. From OLS output not shown here, the OLS standard 
errors are similar whether the default or robust estimates are obtained, 
suggesting little heteroskedasticity. And the logarithmic transformation of 
the dependent variable that has been used often reduces heteroskedasticity. 


We use estat hettest to test against heteroskedasticity, which depends 
on the same variables as those in the OLS regression. Then, 


. * Test for heteroskedasticity in linear model using estat hettest 
. qui regress ltotexp suppins totchr age female white 


. estat hettest suppins totchr age female white, iid 


Breusch-Pagan/Cook-Weisberg test for heteroskedasticity 
Assumption: i.i.d. error terms 
Variables: suppins totchr age female white 


HO: Constant variance 


chi2(5) = 71.38 
Prob > chi2 = 0.0000 


The null hypothesis of homoskedasticity of errors in OLS regression with 
dependent variable 1totexp is soundly rejected. This is consistent with the 
earlier finding using the greg2 command that homoskedasticity in median 
regression was rejected. 


15.3.9 Test of coefficient equality across quantiles 


One can conduct hypothesis tests of equality of the regression coefficients at 
different conditional quantiles. 


Consider a test of the equality of the coefficient of suppins from QR with 
q = 0.25, q = 0.50, and q = 0.75. We first estimate using sqreg, rather than 
qreg Or bsqreg, to obtain the full covariance matrix of coefficients, and then 
we test. Because this uses the bootstrap, we need to set the seed and number 
of bootstrap replications. 


. * Simultaneous QR regression with several values of q 
. set seed 10101 


. sqreg ltotexp suppins totchr age female white, q(.25 .50 .75) reps(400) nodots 


Simultaneous quantile regression Number of obs = 2,955 
bootstrap(400) SEs .25 Pseudo R2 = 0.1292 
.50 Pseudo R2 = 0.1009 
.75 Pseudo R2 = 0.0873 
Bootstrap 

ltotexp | Coefficient std. err. t P>ltl [95% conf. interval] 

q25 
suppins . 3856797 .0642742 6.00 0.000 . 2596529 .5117065 
totchr .459022 .0234579 19.57 0.000 .4130265 .5050175 
age .0155106 . 0043944 3.53 0.000 .0068941 .0241271 
female -.0160694 .0581328 -0.28 0.782 - . 1300543 .0979155 
white . 3375936 . 1110348 3.04 0.002 . 11988 .5553072 
_cons 4.747962 . 3485751 13.62 0.000 4.064487 5.431438 

q50 
suppins 2769771 .0579685 4.78 0.000 . 1633144 . 3906398 
totchr . 3942664 .0195859 20.13 0.000 . 355863 . 4326698 
age .0148666 .0044102 3.37 0.001 .0062192 .0235139 
female -.0880967 .0554863 -1.59 0.112 -. 1968926 .0206992 
white . 4987457 . 2199888 2.27 0.023 .0673984 . 9300929 
_cons 5.648891 . 3966791 14.24 0.000 4.871095 6.426687 

q75 
suppins . 1488548 .0649951 2.29 0.022 .0214143 .2762952 
totchr . 3735364 .0228424 16.35 0.000 . 3287478 .418325 
age .0182506 .0049533 3.68 0.000 . 0085383 .027963 
female -.1219365 .0562735 -2.17 0.030 -.2322759 -.0115971 
white . 1931923 . 2045296 0.94 0.345 - . 2078428 .5942275 
_cons 6.599972 -4247018 15.54 0.000 5.76723 7.432714 


sqreg estimates a QR function for each specified quantile. The coefficients of 
most of the regressors differ substantially across the quantiles. 


We use the test command to perform a Wald test of the hypothesis that 
the coefficients on suppins are the same for the three quantiles. Because we 
are comparing estimates from different equations, we need a prefix to 
indicate each equation. Here the prefix for the model with q = 0.25, for 
example, is [q25]. To test that coefficients on the same variable have the 
same value in different equations, we use the syntax 


test | egname = eqname [= AD | coeflist | 


We obtain 


. * Test of coefficient equality across QR with different q 
. test [q25=q50=q75]: suppins 


( 1) [q25]suppins - [q50]suppins = 0 
( 2) [q25]suppins - [q75]suppins = 0 
F( 2, 2949) = 5.28 
Prob > F = 0.0051 


The null hypothesis of coefficient equality is rejected at significance level 
0.05. 


15.3.10 Graphical display of coefficients over quantiles 


An attractive way to present QR results is via a graphical display of 
coefficients of interest and their respective confidence intervals. This can be 
done manually by estimating the parameters of the QR model for a range of 
values of q, saving the results to file, and producing separate graphs for each 
regressor of the estimated coefficient plotted against the quantile q. 


This is done by the community-contributed grqreg command (Azevedo 
2004), which provides 95% confidence intervals in addition to estimated 
coefficients. One of the qreg, bsqreg, or sqreg commands must first be 
executed, and the confidence intervals use the standard errors from 
whichever command is used. The grgreg command does not have enormous 
flexibility. In particular, it plots coefficients for all regressors, not just 
selected regressors. 


We use grgreg with the options cons to include the intercept in the 
graph, ci to include a 95% confidence interval, and ols and olsci to include 
the OLS coefficient and its 95% confidence interval. The graph option 
scale(1.1) 1s added to increase the size of the axis titles. The command 
uses variable labels on the y axis of each plot, so we provide better variable 
labels for two of the regressors. We have 


. * Plots of each regressor’s coefficients as quantile q varies 
. set seed 10101 

. qui bsqreg ltotexp suppins totchr age female white, reps(400) 
. grqreg, cons ci ols olsci scale(1.1) seed(10101) 


In figure 15.2, the horizontal lines are the OLS point estimates and 95% 
confidence intervals that do not vary with the quantile. The top middle plot 
shows that the coefficient on suppins is positive over most of the range of q, 
with a much larger effect at lower conditional quantiles. In the lower 
conditional quantiles, the point estimates suggest that supplementary 
insurance is associated with 40% higher medical expenditures (recall that 
because the dependent variable is in logs, coefficients can be interpreted as 
semielasticities). Notice that confidence intervals widen at both the extreme 
upper and lower quantiles. 
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Figure 15.2. QR and OLS coefficients and confidence intervals for 
each regressor as q varies from 0 to 1 


15.3.11 Censored conditional QR 


CQR methods are motivated by distributional analysis, but the underlying 
data are often censored, often from below at zero, as is the case with 
individual expenditure data. Then we observe y; = max(y;,c), where c is 
the censoring point, and interest lies in the quantiles Q4 (y* |x) = x’B,. 


The community-contributed cqiv command implements CQR methods for 
censored data proposed by Chernozhukov et al. (2019). It additionally allows 


for one of the continuous regressors to be endogenous. 


15.4 CQR for generated heteroskedastic data 


To gain more insight on CQR, we consider a simulation example where the 
quantiles are known to be linear, and we specify a particular form of 
multiplicative heteroskedasticity using generated data. 


15.4.1 Simulated dataset 


We use a simulated dataset, one where the conditional mean of y depends on 
the regressors £2 and x3, while the conditional variance depends on only 2. 


If y = x'ß + u and u = x'a x g, and it is assumed that x’/q > 0 and 
that £; is 1.1.d., then the quantiles are linear in x with the qth conditional 
quantile Q,(y|x) = x'{8 + a x F>'(q)}; see section 15.3.3. So for 
regressors that appear in the conditional mean but not in the 
heteroskedasticity function (that is, a; = 0), the CQR coefficients do not 
change with q, while for other regressors, the coefficients change even 
though the conditional quantile function is linear in x. 


If we let y = 61 + B2£2 + B3x3 + u, where u = (a, + Q2%2) X €, then 
the CQR coefficients for x2 will change with q, while those for x3 will not. 
This result requires that a; + agx2 > 0, so we set ay > 0 and ag > 0 and 
generate x2 from a y?(1) distribution. 


The specific data-generating process (DGP) is 


y=1+1xgz+1xzgz3 +u; t2~x7(1), 23 ~ N(0,25) 
u = (0.1 + 0.5 x #2) xe; £~ N(0,25) 
We expect that the CQR estimates of the coefficient of x3 will take the 
relatively unchanged value of 1 as the quantiles vary, while the CQR estimates 


of the coefficient of x2 will increase as q increases (because the 
heteroskedasticity is increasing in £2). 


We first generate the data as follows: 


* Generated dataset with heteroskedastic errors 


clear all 
set seed 10101 
10000 


. generate x2 = rchi2(1) 


. qui set obs 


. generate x3 = 5*rnormal(0) 
5*rnormal (0) 
(.1+0.5*x2) *e 


1 + 1*x2 + 1* 


. generate e = 
. generate u = 
. generate y = 


summarize e x2 x3 u y 


x3 +u 


Variable Obs Mean Std. dev. Min Max 
e 10,000 -.0704935 4.978031 -18.59243 19.77932 

x2 10,000 . 9866486 1.384568 1.95e-09 14.02999 

x3 10,000 -.020573 5.018672 -18.65835 16.77711 

u 10,000 - .0389596 4.348951 -52.94281 53.22532 

y 10,000 1.927116 6.778566 -42.21168 60.20479 


The summary statistics confirm that x3 and e have approximately a mean of 
0 and a variance of 25 and that x2 has a mean of 1 and a variance of 2, as 
desired. The output also shows that the heteroskedasticity has induced 
unusually extreme values of u and y that are more than 10 standard 


deviations from the mean. 


We study the distribution of y further by using several plots. We have 


. * Generate scatterplots and qplot 


. qui kdensity u, scale(1.25) lwidth(medthick) saving(graph1.gph, replace) 


. qui qplot y, recast(line) scale(1.4) lwidth(medthick) 
> saving(graph2.gph, replace) xtitle("Fraction of the data") 


. qui scatter y x2, msize(tiny) scale(1.25) saving(graph3.gph, replace) 


. qui scatter y x3, msize(tiny) scale(1.25) saving(graph4.gph, replace) 


. graph combine graphi.gph graph2.gph graph3.gph graph4.gph 


This leads to figure 15.3. The first panel, with the kernel density estimate 
of u, shows that the distribution of the error u is essentially symmetric but 
has very long tails. The second panel shows the quantiles of y and indicates 
symmetry. The third panel plots y against x2 and indicates heteroskedasticity 
and the strongly nonlinear way in which £2 enters the conditional variance 


function of y. The fourth panel shows no such relationship between y and z3 
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Figure 15.3. Density of u, quantiles of y, and scatterplots of 
(y, £2) and (y, z3) 


Here £2 affects both the conditional mean and variance of y, whereas £3 
enters only the conditional mean function. The regressor %2 will impact the 
conditional quantiles differently, whereas x3 will do so constantly. The OLS 
regression can display the relationship only between average y and (x2, £3). 
CQR, however, can show the relationship between the regressors and the 
distribution of y. 


15.4.2 CQR estimates 


We next estimate the regression using OLs (with heteroskedasticity-robust 
standard errors) and QR at q = 0.25, 0.50, and 0.75, with bootstrap standard 
errors. The saved results are displayed in a table. The relevant commands 
and the resulting output are as follows: 


. * OLS and quantile regression for q = .25, .5, .75 
. qui regress y x2 x3 


. estimates store OLS 

. qui regress y x2 x3, vce(robust) 

. estimates store OLS_Rob 

. set seed 10101 

. qui bsqreg y x2 x3, quantile(.25) reps(400) 
. estimates store QR_25 

. qui bsqreg y x2 x3, quantile(.50) reps(400) 
. estimates store QR_50 

. qui bsqreg y x2 x3, quantile(.75) reps(400) 
. estimates store QR_75 

. estimates table OLS OLS_Rob QR_25 QR_50 QR_75, b(%7.3f) se 


Variable OLS OLS_Rob QR_25 QR_50 QR_75 
x2 0.978 0.978 -0.666 0.986 2.573 

0.031 0.097 0.066 0.068 0.077 

x3 1.000 1.000 1.000 1.001 1.003 

0.009 0.008 0.003 0.003 0.003 

_cons 0.983 0.983 0.638 0.977 1.345 
0.053 0.068 0.018 0.019 0.019 


Legend: b/se 


The median regression parameter point estimates of 89.5.2 and 6.5.3 in 
this example with symmetrically distributed errors are close to the true 
values of 1. Interestingly, the median regression parameter estimates are 
much more precise than the OLS parameter estimates. This improvement is 
possible because OLS is no longer fully efficient when there is 
heteroskedasticity. 


Because the heteroskedasticity depends on £2 and not on £3, the 
estimates of 872 vary over the quantiles q, while 6,3 is invariant with respect 
to q. The DGP implies that the coefficient of x2 equals 1 + 0.5 x F>'(0.75), 
where e ~ N(0, 57). 


. * Predicted coefficients of x2 from quantile regression 
. qui summarize e, detail 


. display "Predicted coefficient of x2 for q = .25, .50, and .75" _newline 


> i are " 1+.5x5xinvnormal (0.25) ", " 1+.5*5*invnormal (0.5) 
> ", and " 1+.5*5*invnormal (0.75) 
Predicted coefficient of x2 for q = .25, .50, and .75 

are -.68622438, 1, and 2.6862244 


For q = 0.75, for example, the estimated coefficient of x2 1s 2.573, close to 
the theoretical value of 2.686. 


We can test the difference in CQR estimates by using the bsgreg 
command. A test of 80.25,2 = 80.75,2 can be interpreted as a robust test of 
heteroskedasticity independent of the functional form of the 
heteroskedasticity. The test is implemented as follows: 


. * Test equality of coeff of x2 and equality of coeff of x3 for q=.25 and q=.75 
. set seed 10101 


. qui sqreg y x2 x3, q(.25 .75) reps(400) 
. test [q25]x2 = [q75]x2 
( 1) [q25]x2 - [q75]x2 = 0 


F( 1, 9997) 1450.56 
Prob > F = 0.0000 


. test [q25]x3 = [q75]x3 
( 1) [q25]x3 - [q75]x3 = 0 


FC 1, 9997) = 0.67 
Prob > F = 0.4128 


The test outcome leads to a strong rejection of the hypothesis that x2 does 
not affect both the location and scale of y. As expected, the test for 73 yields 
a p-value of 0.41, which does not lead to a rejection of the null hypothesis. 


15.5 Quantile treatment effects for a binary treatment 


The treatment evaluation literature considers in detail the effect on an 
outcome of interest of a binary treatment. The simplest and cleanest example 
is an experiment with random assignment of individuals to either a treated 
group or a control group. Basic difference-in-means analysis calculates the 
average treatment effect Ytreat — Ycontro] and tests whether this is 
statistically different from zero. A richer analysis estimates the difference at 
various quantiles, rather than the difference in means. 


Consider a binary treatment D that takes values 1 if treated and 0 if not 
treated. The quantile treatment effect (QTE) of D at a selected quantile q is a 
measure of the shift in the distribution of y resulting from a unit shift in the 
treatment variable. That is, 


Ag = Q,(Y|D = 1) — Q,(Y|D = 9) 


As an example, consider the effect on the quantiles of 1totexp on whether 
the person has supplementary health insurance. Quantiles for those with or 
without supplementary insurance can be obtained using the centile 
command or, equivalently, using CQR regression on a constant. 


For example, separate estimation of the 25th quantile by value of 
suppins yields 


. * QTE at q=0.25 by separate estimation for suppins==0 and suppins== 
. qui use mus203mepsmedexp, clear 


. qui drop if ltotexp ==. 
. qreg ltotexp if suppins==0, quantile(0.25) vce(robust) nolog 


.25 Quantile regression Number of obs = 1,207 
Raw sum of deviations 559.861 (about 7.0282016) 
Min sum of deviations 559.861 Pseudo R2 = 0.0000 
Robust 
ltotexp | Coefficient std. err. t P>|t| [95% conf. interval] 
_cons 7.028202 .055348 126.98 0.000 6.919613 7.136791 
. scalar q25_0 = _b[_cons] 
. qreg ltotexp if suppins==1, quantile(0.25) vce(robust) nolog 
.25 Quantile regression Number of obs = 1,748 
Raw sum of deviations 720.714 (about 7.4235682) 
Min sum of deviations 720.714 Pseudo R2 = 0.0000 
Robust 
ltotexp | Coefficient std. err. t P>|t| [95% conf. interval] 
_cons 7.426549 .0403812 183.91 0.000 7.347348 7.505749 
. scalar q25_1 = _b[_cons] 
. display "QTE of suppins at q=25 = " q25_1 - q25_0 


QTE of suppins at q=25 = .39834738 


The QTE at q = 0.25 is the difference 7.426549 — 7.028202 = 0.398347 , 
which is quite substantial. 


The same QTE estimate can be obtained more simply by CQR of 1totexp 
on an intercept and suppins. 


. * QTE at q=0.25 by direct CQR on intercept and suppins 
. qreg ltotexp suppins, quantile(0.25) vce(robust) nolog 


.25 Quantile regression Number of obs = 2,955 
Raw sum of deviations 1293.577 (about 7.2675252) 
Min sum of deviations 1280.575 Pseudo R2 = 0.0101 
Robust 
ltotexp | Coefficient std. err. t P>|t | [95% conf. interval] 
suppins . 3983474 .0700433 5.69 0.000 . 2610088 . 535686 


7.028202 0569596 123.39 0.000 6.916517 7.139886 


-cons 


Note that the QTE is not measuring the impact of supplementary 
insurance for a given person. That is, we do not assume that a person who is 
in the 25th percentile without supplementary insurance would also be in the 
25th percentile if he or she instead had supplementary insurance, a strong 
restriction called rank invariance or rank preservation, which is discussed 
further in section 25.8.4. 


A useful way to present results for a range of quantiles is to use graphical 
displays. 


We first obtain separate quantile plots by the two values of suppins 
using the community-contributed gp1ot command and overlay these on the 
same graph. 


. * Quantile plot of ltotexp for those with and without supplementary insurance 
. qplot ltotexp, over(suppins) clp(1 _) recast(line) scale(1.1) 


> ytitle("Quantiles of lntotexp") xtitle("Fraction of the data") 
> legend(pos(11) ring(0) col(1)) 
> legend(label(1 "suppins=0") label(2 "suppins==1")) 


The first panel of figure 15.4 gives the plots. The vertical distance 
between the two curves at a given value of q measures the unconditional ME 
of insurance. For example, at q = 0.25, we have the values 7.43 and 7.03 
that were obtained earlier. Over most of the range of log(expenditure), 
having supplementary insurance is associated with higher expenditure; an 
exception occurs around the 90th percentile. The difference is greatest at 
lower quantiles. 
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Figure 15.4. Quantile and density plots of log expenditure for 
those with and without supplementary insurance 


The QTE at various quantiles can be plotted using grqreg following CQR 
of 1totexp on an intercept and suppins. 


. * Plot of the QTE of supplementary insurance at different quantiles 
. set seed 10101 


. qui bsqreg ltotexp suppins, reps(400) 
. grqreg, ci ols olsci scale(1.1) seed(10101) 


The second panel of figure 15.4 gives the plot. The solid line gives the 
QTEs, and the shaded area gives 95% confidence intervals. For example, at 
q = 0.25 the solid line has value 0.40, as obtained earlier. It is clear that the 
QTE is much larger at the lowest quantiles and is actually slightly negative at 
the highest quantile. The dashed lines give the OLS estimate, which is the 
average treatment effect, along with 95% confidence intervals. 


this approach applied to an experiment with random assignment. Extension 
to nonexperimental settings such as the example of this chapter, however, is 
not straightforward. 


First, the QTEs in our example can be given only a causal interpretation if 
supplementary insurance was randomly assigned. Random assignment 
allows the distribution of 1totexp for those without supplementary 
insurance to also be used as the distribution of 1totexp for those with 
supplementary insurance in the counterfactual situation of their not having 
supplementary insurance. 


Second, to obtain QTEs with a causal interpretation, we need to control 
for the endogeneity of suppins. A simple regression approach assumes it is 
sufficient to add additional control variables and estimate by CQR methods 
already presented. An instrumental variables approach uses instruments for 
suppins. Several methods for quantile instrumental variables have been 
proposed, based on different assumptions. 


Finally, we may ultimately be interested in the impact of suppins on the 
marginal distribution or unconditional distribution of 1totexp, rather than on 
the distribution of 1totexp after conditioning on variables such as totchr, 


age, sss 


These complications are handled by quantile instrumental variables and 
by unconditional QR methods presented in chapter 25. 


15.6 Additional resources 


The Stata commands related to greg are bsqreg, iqreg, sgreg, and the 
community-contributed command greg2. See [R] qreg and [R] qreg 
postestimation for details of Stata’s official cQR commands. The 
community-contributed gplot command is illustrated by its author in some 
detail in Cox (2005). The community-contributed grgreg command was 
created by Azevedo (2004). 


15.7 Exercises 


1. Consider the medical expenditures data example of section 15.3, except 
use totexp rather than 1totexp as the dependent variable. Use the same 
sample, so still drop if 1totexp==.. Estimate the parameters of the 
model with q = 0.5 using greg, and comment on the parameter 
estimates. Reestimate using bsqreg and compare results. Use sqreg to 
estimate at the quantiles 0.25, 0.50, and 0.75. Compare these estimates 
with each other (and their precision) and also with OLs (with robust 
standard errors). Use the community-contributed grqreg command after 
bsqreg to further compare estimates as qreg varies. 

2. Use the medical expenditures data of section 15.3. Show that the median 
of ltotexp equals the exponential of the median of totexp. Now, add a 
single regressor, the indicator female. Then any conditional quantile 
function must be linear in the regressor, with Q,(In y|x) = agı + Qq2 

female and Qa(y|x) = Ba + Bqatemale. Show that if we estimate greg 

ltotexp female, then predict, and finally exponentiate the prediction, 
we get the same prediction as that directly from qreg totexp female. 
Now, add another regressor, say, totchr. Then the conditional quantile 
may no longer be linear in female and totchr. Repeat the prediction 
exercise, and show that the invariance under the transformation property 
no longer holds. 

3. Use the medical expenditures data example of section 15.3 with the 
dependent variable 1totexp. Test the hypothesis that heteroskedasticity 
is a function of the single variable totchr, which measures the number 
of chronic conditions. Record the test outcome. Next, test the hypothesis 
that the location and scale of the dependent variable expenditures vary 
with totchr. What is the connection between the two parts of this 
question? 

4. Use the medical expenditures data of section 15.3, and estimate the 
parameters of the model for 1totexp using greg with q = 0.5. Then, 
estimate the same parameters using bsqreg with the number of bootstrap 
replications set at 10, 50, 100, and 400. In each case, use the same seed 
of 10101. Would you say that a high number of replications produces 
substantially different standard errors? 

5. Consider the heteroskedastic regression example of section 15.4. Change 
the specification of the variance function so that the variance function is 


a function of x3 and not z2; that is, reverse the roles of z2 and z3. 
Estimate QRs for the generated data for q = 0.25, 0.50, and 0.75. 
Compare the results you obtain with those given in section 15.3.2. Next, 
vary the coefficient of x3 in the variance function, and study its impact 
on the QR estimates. 

. Consider the heteroskedasticity example of section 15.4. There the 
regression error is symmetrically distributed. Suppose we want to study 
whether the differences between OLS and QR results are sensitive to the 
shape of the error distribution. Make suitable changes to the simulation 
data, and implement an analysis similar to that in section 15.4 with 
asymmetric errors. For example, first draw u from the uniform 
distribution, and then apply the transformation — A log(u), where A > 0. 
(This generates draws from the exponential distribution with a mean of A 
and a variance of X2.) 

. When the number of regressors in the QR is very large and one wants to 
generate graphs only for selected coefficients, it may be necessary to 
write one’s own code to estimate and save the coefficients. This would 
be followed by a suitable twoway plot. The following program uses 
post file to save the output in bsqrcoefl.dta, forvalues to loop 
around values of q = 0.1(0.1)0.9, and bsqreg to estimate bootstrap 
standard errors. Run the following program, and use bsqrcoefl.dta and 
the twoway command to generate a plot for the coefficient of suppins as 
q varies. 


* Save coefficients, and generate graph for a range of quantiles 
use musO3data.dta, clear 
drop if ltotexp ==. 
capture program drop mus07plot 
program musO/7plot 
postfile myfile percentile bi upper lower using bsqrcoefi.dta, replace 
forvalues tau1=0.10(0.1)0.9 { 
set seed 10101 
quietly bsqreg ltotexp suppins age female white totchr, Jit 
quantile(~taui’) reps(400) 
matrix b = e(b) 
scalar bi=b[1i,1] 
matrix V = e(V) 
scalar vi=V[1,1] 
scalar df=e(df_r) 
scalar upper = bi + invttail(df, .025)*sqrt(v1) 
scalar lower = bi - invttail(df, .025)*sqrt(v1) 
post myfile (“taul”) (b1) (upper) (lower) 
matrix drop V b 
scalar drop bi vi upper lower df 
J 
postclose myfile 
end 
musO7plot 
program drop musO7plot 
use bsqrcoefi.dta, clear 


twoway connected bi percentile || line upper percentile || /// 
line lower percentile, Tid 
title("Slope estimates") subtitle("Coefficient of suppins") /// 
xtitle("Quantile", size(medlarge) ) die 
ytitle("Slope and confidence bands", size(medlarge) ) /// 
legend( label(1 "Quantile slope coefficient") iff 
label(2 "Upper 95% bs confidence band") iid 


label(3 "Lower 95% bs confidence band") ) 
graph save bsqrcoefi.gph, replace 


. Reconsider the analysis in section 15.3. Omit the variable female from 
the QR. Instead, run the regression separately for males and females for 
q = 0.25 and q = 0.75. Compare the results and comment on 
differences, if any. 


Appendix A 
Programming in Stata 


In this appendix, we build on the introduction to Stata programming given 
in chapter 1. We first present Stata matrix commands, introduced in 
section 1.5. The rest of the appendix focuses on aspects of writing Stata 
programs, using the program command introduced in section 5.3.1. We 
discuss programs to be included within a Stata do-file, ado-files, which are 
programs intended to be used by other Stata users, and some tips for 
program debugging that are relevant for even the simplest Stata coding. 


A.1 Stata matrix commands 


Here we consider Stata matrix commands, initiated with the matrix prefix. 
These provide a limited set of matrix commands sufficient for many uses, 
especially postestimation manipulation of results, as introduced in 

section 1.6, and are comparable with matrix commands provided in other 
econometrics packages. 


The separate appendix B presents Mata matrix commands, introduced in 
Stata 9. Mata is a full-blown matrix programming language, comparable 
with R, MATLAB, or Gauss. 


A.1.1 Stata matrix overview 


Key considerations are inputting matrices, either directly or by converting 
data variables into matrices, and performing operations on matrices or on 
subcomponents of the matrix such as individual elements. 


The basics are given in [u] 14 Matrix expressions and in [P] matrix. 
Useful online help commands include help matrix, help matrix operators, 
and help matrix functions. 


A.1.2 Stata matrix input and output 

There are several ways to input matrices in Stata. 

Matrix input by hand 

Matrix entries can be entered by using the matrix define command. For 
example, consider a 2 x 3 matrix with the first row entries 1, 2, and 3 and 


the second row entries 4, 5, and 6. Column entries are separated by commas, 
and rows are separated by a backslash. We have 


* Define a matrix explicitly and list the matrix 
. Matrix define A = (1,2,3 \ 4,5,6) 


. matrix list A 


A[2,3] 

ci c2 c3 
ri 1 2 3 
r2 4 5 6 


The word define can be omitted from the above command. 


The default names for the matrix rows are r1, r2, ..., and the column 
defaults are c1, c2, .... These names can be changed by using the matrix 
rownames and matrix colnames commands. For example, to give the names 
one and two to the two rows of matrix A, type the command 


* Matrix row and column names 
. Matrix rownames A = one two 


. matrix list A 


A[2,3] 

c1 c2 c3 
one 1 2 3 
two 4 5 6 


An alternative matrix naming command is matname. 


Matrix input from Stata estimation results 


Matrices can be constructed from matrices created by the Stata estimation 
command results stored in e () or r (). For example, after ordinary least- 
squares (OLS) regression, the variance—covariance matrix is stored in e (V). 
To give it a more obvious name or to save it for later analysis, we define a 
matrix equal to e (v). 


As a data example, we use the same dataset as in chapter 3. We use the 
first 100 observations and regress medical expenditures (1totexp) on an 
intercept and chronic problems (totchr). We have 


. * Read in data, summarize, and run regression 


. use mus203mepsmedexp 


(A.C.Cameron & P.K.Trivedi (2022): Microeconometrics Using Stata, 2e) 


. keep if _n <= 100 
(2,964 observations deleted) 


. drop if ltotexp == . | totchr ==. 
(O observations deleted) 


. summarize ltotexp totchr 


Variable | Obs Mean Std. dev. Min Max 
ltotexp 100 4.533688 .8226942 1.098612 5.332719 
totchr 100 .48 . 717459 0 3 
. regress ltotexp totchr, noheader 
ltotexp | Coefficient Std. err. t P>|t| [95% conf. interval] 
totchr . 1353098 . 1150227 1.18 0.242 -.0929489 . 3635685 
_cons 4.468739 . 0989462 45.16 0.000 4.272384 4.665095 


A command to drop observations with missing values from the dataset in 
memory is included because not all matrix operators considered below 


handle missing values. 


We then obtain the variance matrix stored in e (v) and list its contents. 


* Create a matrix from estimation results 


. matrix vbols = e(V) 
. Matrix list vbols 


symmetric vbols[2,2] 
totchr _cons 
totchr .01323021 
_cons -.0063505 .00979036 


Stata has incorporated the regressor names into the estimate of the variance— 
covariance matrix of the estimator so that vbols has rows and columns 


named totchr and cons. 


A.1.3 Stata matrix subscripts and combining matrices 


Matrix subscripts are represented in square brackets. The entry (i, j) ina 
matrix is denoted [i,j]. For example, to set the (1, 1) entry in matrix A to 
equal the (1, 2) entry, type the command 


. * Change value of an entry in matrix 
. matrix A[1,1] = A[1,2] 


. matrix list A 
A[2,3] 


If the row or column has a name, one can alternatively use this name. For 
example, because row | of a is named one, we could have typed matrix 
A[1,1] =A["one",2]. 


For a column vector, the ¿th entry is denoted by [i,1] rather than simply 
[i]. Similarly, for a row vector, the jth entry is denoted by [1,4] rather than 
simply [j]. 


Matrix subscripts can be given as a range, permitting a submatrix to be 
extracted from a matrix. For example, to extract all the rows and columns 2— 
3 from matrix a, type 


. * Select part of matrix 
. Matrix B = A[1...,2..3] 


. Matrix list B 


B[2,2] 

c2 c3 
one 2 3 
two 5 6 


Here k... selects the kth entry on, and k..1 selects the &th—/th entry. 


To add or append rows to a matrix, use the vertical concatenation 
operator \. For example, a \ B adds rows of B after the rows of A. Similarly, 
to add columns to a matrix, use the horizontal concatenation operator ,. For 
example, 

. * Add columns to an existing matrix 
. matrix C = B, B 

. matrix list C 

c[2,4] 


A.1.4 Matrix operators 


All the standard matrix operators can be applied, provided that the matrices 
are conformable. The operators are + to add, - to subtract, + to multiply, and 
# for the Kronecker product. In addition, the multiplication command can be 
used for multiplication by a scalar, for example, 2*A or A*2, and scalar 
division is possible, for example, a/2. A single apostrophe, ’, gives the 
matrix transpose. To compute a’ a, we use a’ *A. For example, 


. * Matrix operators 
. matrix D = C + 3*C 


. matrix list D 


D[2,4] 

c2 c3 c2 c3 
one 8 12 8 12 
two 20 24 20 24 


A.1.5 Matrix functions 


Standard matrix functions are defined by using parentheses, (). Some 
commands lead to a scalar result, for example, 


* Matrix functions 
. matrix r = rowsof (D) 


. matrix list r 


symmetric r[1,1] 
c1 
ri 2 


In this example, it is more convenient to store the result as a scalar, 
rather than in a 1 x 1 matrix. For example, 


. * Can use scalar if 1 x 1 matrix 
scalar ralt = rowsof (D) 


. display ralt 
2 


Functions that produce scalars include colsof (A), det (A), rowsof (A), and 
trace (A). 


Other commands produce matrices. For example, matrix B, created 
earlier, is a nonsymmetric square matrix, with the inverse 


. * Inverse of nonsymmetric square matrix 
. matrix Binv = inv(B) 


. Matrix list Binv 


Binv [2,2] 

one two 
c2 -2 1 
c3 1.6666667 -.66666667 


Here are some functions that produce matrices: cholesky (A), corr (A), 
diag (A), hadamard (A,B), I(n), inv(A), invsym(A), vec (A), and 


vecdiag (A). 


A.1.6 Matrix accumulation commands 


Most estimators, such as the OLS estimator (X’X)~1X'y, require the 
computation of matrix cross products. We strongly recommend that you do 
not put your data into Stata matrices. Stata has accumulation commands that 
compute cross products from variables and return the results in Stata 
matrices. If you really want to put your data into a matrix, refer to 

appendix B on Mata. 


Stata’s matrix accumulation commands compute the matrix cross 
products X’X and X’y without requiring the intermediate step of forming 
the much larger matrices X and Y. 


As an example, the matrix accum A = v1 v2 command creates a 3 x 3 
matrix A = Z'Z, where Z is an N x 3 matrix with columns of the variables 
v1 and v2 and a column of Is that accum automatically appends unless the 
noconstant option is used. The companion matrix vecaccum A = w v1 v2 
command creates a 1 x 3 row vector A = w’Z, where w isan N x 1 
column vector with a column of the variable w and Z is an N x 3 matrix 
with columns of the variables v1 and v2 and a column of 1s that again accum 
automatically adds at the end unless the noconstant option is used. Related 
commands are matrix glsaccum, which forms weighted cross products of 
the form X’BX, and matrix opaccum. 


The following code produces the same point estimates as regress 
ltotexp totchr: 


. * OLS estimator using matrix accumulation operators 
. Matrix accum XTX = totchr // Form X°X including constant 
(obs=100) 


. matrix vecaccum yTX = ltotexp totchr // Form y°X including constant 
. matrix cols = invsym(XTX)*(yTX) ° 
. Matrix list cols 


cols[2,1] 
ltotexp 
totchr .13530976 
_cons 4.4687394 


A.1.7 OLS using Stata matrix commands 


The following example runs an OLS regression of 1totexp on an intercept 
and totchr, and it also reports the default OLs standard errors and associated 
t statistics. We use matrix accumulation commands so that large problems 
can be handled. The challenge in using these commands 1s to obtain 

s? = >, (yi — D)? without having to form an N x 1 vector of predicted 
values. One way to do this is to use the result that for OLS 


N —~\2_ os OD Sie A 
Bren. = Yi) = y'y — B X'XB.- 


We have 


* Illustrate Stata matrix commands: OLS with output 
. matrix accum XTX = totchr // Form X°X including constant 
(obs=100) 


. Matrix vecaccum yTX = ltotexp totchr // Form y°X including constant 
. Matrix b = invsym(XTX) *(yTX) © 


. matrix accum yTy = ltotexp, noconstant 
(obs=100) 
scalar k = rowsof (XTX) 
scalar n = _N 
. matrix s2 = (yTy - b°*XTX°*b)/(n-k) 
. Matrix V = s2*invsym(XTX) 
. matrix list b 


b[2,1] 
ltotexp 
totchr .13530976 
_cons 4.4687394 


. Matrix list V 


symmetric V[2,2] 
totchr _cons 
totchr .01323021 
_cons -.0063505 .00979036 


This yields the same estimates of the coefficients and variance—covariance 
matrix of the estimator as listed in appendix A.1.2. 


We now want to obtain output formatted in the usual way with columns 
of regressor names, coefficient estimates, standard errors, and ¢ statistics. 
This is not straightforward using Stata matrix commands. We wish to form 
the column vector t with the jth entry t; = b;/s; = b;/,/V;;. But Stata 
provides no facility for element-by-element division and also no easy way to 
take the element-by-element square root of a matrix. One fix is to first form 
a column vector, seinv, with the jth entry 1/s; by creating a diagonal 
matrix with the entries s2, taking the inverse of this matrix, taking the square 
root of this matrix, and forming a column vector with the resulting diagonal 
entries. Then, form the column vector t by using the Hadamard product of p 
and seinv, where for matrices a and B of the same dimension, 

C=hadamard (A,B) gives the matrix C with the ijth entry Cj; = Ai; x Bij. 
We obtain 


* Stata matrix commands to compute standard errors and t statistics 
> given b and V 
. matrix se = (vecdiag(cholesky (diag(vecdiag(V)))))~ 


. matrix seinv = (vecdiag(cholesky(invsym(diag(vecdiag(V))))))~° 
. matrix t = hadamard(b,seinv) 

. Matrix results = b, se, t 

. Matrix colnames results = coeff sterror tratio 

. matrix list results, format (%7.0g) 


results [2,3] 
coeff sterror tratio 
totchr .13531 .11502 1.1764 
_cons 4.4687 .09895 45.163 


It is much easier to instead use Stata’s ereturn commands to produce 
this output, based on a row vector of coefficient estimates and an estimated 
variance matrix. The preceding code led to a column vector of coefficients, 
so we first need to transpose. We obtain 


* Easier is to use ereturn post and display given b and V 
. matrix brow = b’ 


. ereturn post brow V 


. ereturn display 


Coefficient Std. err. Zz P>|z| [95% conf. interval] 
totchr . 1353098 . 1150227 1.18 0.239 -.0901305 . 36075 
_cons 4.468739 .0989462 45.16 0.000 4.274808 4.66267 


Similar code for OLS that instead uses Mata functions is provided in 
section 3.9. 


A.2 Programs 


Do-files, ado-files, and program files are collections of Stata commands that 
are useful whenever the same analysis is to be repeated exactly or with 
relatively minor variation. For many analyses, a do-file that enacts Stata 
commands (which are themselves often ado-files written in Stata or Mata) is 
sufficient. 


More advanced analysis, however, may require actual programming in 
Stata. These programs can be defined and executed as a component of a do- 
file, or they can be converted to an ado-file to enable their being called by 
other programs. Useful references are [U] 18 Programming Stata and 
[P] program. 


A.2.1 Simple programs (no arguments or access to results) 


A program is defined by using the program define command followed by 
the program name. Subsequent lines give the program, which concludes with 
the line end. 


The simplest programs do not have any inputs, and the program output is 
simply displayed. The following program displays the current time and date. 


. * Program with no arguments 

. program define time 
1. display c(current_time) c(current_date) 
2. end 


The word define is optional in the above input—tt is sufficient to simply 
type program time. 


The program is executed by typing the name of the program. We have 


. * Run the program 
. time 
12:29:29 1 Nov 2021 


Unlike the execution of a do-file, only the program results are listed, here the 
current date and time; the program commands that were executed are not 


listed. 
A.2.2 Modifying a program 


Stata does not allow one to redefine an existing program. So it is necessary 
to first remove any previous program with the same name, should such a 
program already exist. 


The program drop time command will drop the time program. If this 
program does not already exist, however, Stata will stop executing and 
generate an error message. The capture prefix ensures that Stata will 
continue to run, even if the time program does not already exist. 


Thus, the preferred way to define and then execute the time program is 


. * Drop program if it already exists, write program, and run 
. capture program drop time 


. program time 
1. display c(current_time) c(current_date) 
2. end 


. time 
12:29:29 1 Nov 2020 


The clear command does not drop programs, though clear all will. To 
specifically drop all programs, use the clear programs command or the 
program drop all command. 


A.2.3 Programs with positional arguments 


More complicated programs have inputs called arguments. For example, the 
Stata regress command has as arguments the dependent variable and any 
regressor variables. Then execution of the command requires that one give 
both the command name and the command arguments, for example, regress 
y x1 x2. These arguments need to be passed into the program and then 
referred to appropriately within the program. 


Program arguments can be passed as positional arguments. The first 
argument is referred to by the local macro ‘1’ within the program, the 
second by local macro ‘2’, and so on. For example, for regress, the 


dependent variable, say, y, may be referred to internally as ‘1’. The 
quotation marks differ from how they appear on this printed page. On most 
keyboards, the left quote is located in the upper left, and the right quote is 
located in the middle right. When viewed using a text editor, the single quote 
appears correctly. But often when viewed in LaTeX documents, they 
misleadingly appear as ‘1 ’ rather than the correct ‘1’. 


We present a program to report the median of the difference between two 
variables, where the two variables need to be passed to the program. Using 
positional arguments, we have 


. * Program with two positional arguments 
. program meddiff 


1. tempvar diff 

2. generate “diff° = “1° - `27 

3. _pctile “diff~, p(50) 

4. display "Median difference = " r(r1) 
5. end 


The program uses a temporary variable, diff, explained in the next section, 
to store the difference between the two variables. Several commands 
calculate the median. Here we use the pctile command with the p (50) 
option. This command stores the resulting median in r (r1), which we then 
output by using the display command. 


We now run the meddiff program, using the same dataset and variables, 
ltotchr and totchr, as used in appendix A.l. We have 


. * Run the program with two arguments 
. meddiff ltotexp totchr 
Median difference = 4.2230513 


A.2.4 Temporary variables 


The meddiff program requires the computation of the intermediate variable 
we have named diff. To ensure that this name does not conflict with the 
names of variables elsewhere and that the variable is dropped as soon as the 
program ends, we use the tempvar command to define a temporary variable 
that is local only to the program and is dropped after the program has 
executed. This temporary variable is declared by using tempvar and is then 
referred to in the same left and right quotation marks as are used for local 


macros. Similarly, the tempname command can be used to declare temporary 
scalars and matrices, and the tempfile command can be used to declare 
temporary files. 


A.2.5 Programs with named positional arguments 


It is much easier to read the program if it gives names to the positional 
arguments ‘1’, ‘2’, .... To use named positional arguments, we first define 
the arguments within the program in the order that they will appear in the 
command. For example, 


. * Program with two named positional arguments 
. Capture program drop meddiff 


. program meddiff 


1. args y x 

2. tempvar diff 

3. generate “diff° = “y° - “x” 

4 _pctile “diff~, p(50) 

5. display "Median difference = " r(r1) 
6. end 


. meddiff ltotexp totchr 
Median difference = 4.2230513 


As for temporary variables, the arguments are declared without quotes, but 
Stata stores arguments as local macros, so we need to use quotes to refer to 
the arguments. 


A.2.6 Storing and retrieving program results 


The preceding examples simply displayed results. Often, we want to store 
program results for further data analysis. This can be done by storing the 
results in r() and e(), introduced in section 1.6, and s(). To do this, we 
need to define the program to be of the relevant class and to return the 
results to the named entries in r(), e(), Or s(). 


For our example, we declare the program to be rclass, with just one 
result that will be stored in r (medylx). We have 


. * Program with results stored in r() 
. Capture program drop meddiff 


. program meddiff, rclass 


1. args y x 

2. tempvar diff 

3. generate “diff = `y% - `x’ 
4. _pctile “diff, p(50) 

5. return scalar medylx = r(r1) 
6. end 


Executing the program produces no output; the results of executing the 
program are instead stored in r(). To list the program results in r (), we use 
the return list command, and to display the scalars in r (), we use the 
display command. 


. * Running the program does not immediately display the result 
. meddiff ltotexp totchr 


. return list 


scalars: 
r(medylx) = 4.223051309585571 


. display r(medylx) 
4.2230513 


An example of an eclass program, returning b and v for subsequent 
analysis, is given in section 12.4.4. 


A.2.7 Programs with arguments using standard Stata syntax 


Program arguments can be quite lengthy and can include optional arguments, 
but many commands use arguments in a standard format. In particular, if 
commands use the standard Stata syntax, then tools exist to parse the 
command, breaking down the long command into its various arguments. 


The full standard Stata syntax is 


command [ varlist | namelist | anything | [ of | [ in ] [using filename | [ =exp | | weight | |, 
options ] 


Brackets denote optional items. For some commands, some of the items in 
brackets will be required to run that specific command. In the syntax for that 
specific command, these required items will not be surrounded by brackets. 


As an example, consider the command 


. regress ltotexp totchr if !missing(ltotexp) in 1/100, vce(robust) 


To enact this command, Stata interprets regress as command, ltotexp 
totchr as varlist, if !missing(ltotexp) aS if, in 1/100 as in, and 

vce (robust) as an option. For the regress command, Stata needs to further 
break down varlist, with the first variable being the dependent variable and 
any remaining variables being regressors. 


We now show how this is done. We write a program, myo1s, that 
duplicates regress. Specifically, we want to be able to break the command 


. myols ltotexp totchr if !missing(ltotexp) in 1/100, vce(robust) 


into its arguments and then execute regress with these arguments. 


To do so, we use the syntax and gettoken commands as illustrated in 
the following program. 


. * Program that uses Stata commands syntax and gettoken to parse arguments 
. program myols 


1. syntax varlist [if] [in] [,vce(string)] 

2. gettoken y xvars : varlist 

3. display "varlist contains: " ""varlist™" 

4. display "and if contains: " ""if"" 

5. display "and in contains: "  ""in™" 

6. display "and vce contains: " "vce" 

7. display "and y contains: " "`y?" 

8. display "& xvars contains: " ""xvars~" 

9. regress `y” ~“xvars” “if~ “in”, “vce” noheader 
10. end 


The syntax command lists required arguments—which here are a list of 
variable names (varlist)—and optional arguments—which here are an if 
qualifier, an in range qualifier, and the vce () option with a string argument 
for the specific option to be used ([, vce (string) ]). The syntax command 
will put in the local ‘varlist’ macro any list of variable names that appears 
after myols and before the if or in qualifiers; in the local ‘if’ macro any if 
qualifier; in the local ‘in’ macro any in range qualifier; and in the local 
‘vce’ macro any vce_option. The names in ‘varlist’ are space-separated 
tokens. The specific form of the gettoken command used here puts the first 


token in ‘varlist’ into the local `y’ macro and the remaining tokens into 
the local ‘xvars’ macro. 


The unnecessary display commands are included to demonstrate that 
the parsing occurs as desired. Note the use of compound quotes. For 
example, to display the name in the local `y’ macro, we use the display 
"y" command. If instead we use display ‘y’, then we would see the 
value of the variable in the local ‘y’ macro. 


We then execute the myols program for an example. 


. * Execute program myols for an example 

. myols ltotexp totchr if !missing(ltotexp) in 1/100, vce(robust) 
varlist contains: ltotexp totchr 

and if contains: if !missing(ltotexp) 

and in contains: in 1/100 

and vce contains: robust 

and y contains: ltotexp 

& xvars contains: totchr 


Robust 
ltotexp | Coefficient std. err. t P>|t| [95% conf. interval] 
totchr . 1353098 . 1089083 1.24 0.217 -.0808151 . 3514347 
_cons 4.468739 . 1089425 41.02 0.000 4.252547 4.684932 


The arguments of the myols command have been parsed successfully, 
leading to the expected output from regress. 


A.2.8 Ado-files 


Some Stata commands, such as summarize, are built-in commands. But 
many Stata commands are defined by an ado-file, which is a collection of 
Stata commands. For example, the file logit.ado defines the logit 
command for logit regression. Furthermore, Stata users can also define their 
own Stata commands by using ado-files. We use many such community- 
contributed commands throughout this book. 


An ado-file is a program file similar to those already presented. But 
because they are intended for wider use, they are generally more tightly 
written. Temporary variables, scalars, and matrices are used to avoid 


potential name conflicts with the program calling the ado-file. Variables may 
be generated in double precision. Care is given to the output from the 
program, such as by using the quietly prefix to suppress the unnecessary 
printing of intermediate results. Comments are provided, such as the current 
version number and date. And there should be various checks to ensure that 
the command is being correctly used (for example, if an input to the program 
should be positive, then send an error message if this is not the case). 


A good example of the development of an ado-file is given in 
[u] 18.11 Ado-files. For an estimation command, see Gould, Pitblado, and 
Poi (2010, chap. 10). 


Here we provide a brief example, converting the meddiff program from 
earlier into an ado-file. Specifically, the meddiff.ado file comprises 


. Capture program drop meddiff 


. *! version 2.1.0 15aug2021 
. program meddiff, rclass 


1. version 17 

2. args y x 

3. tempvar diff 

4. qui { 

5. generate double “diff° = `y” - “x” 

6. _pctile “diff’, p(50) 

7. return scalar medylx = r(r1) 

8. } 

9. display "Median of first variable - second variable = " r(r1) 
10. end 


The program begins with the version and date. The program is written for 
Stata 17. The quietly prefix suppresses output. For example, if `y’ or ‘x’ 
has any missing values, then the generate statement will lead to a statement 
that missing values were generated. This statement will be suppressed here. 
The ‘diff’ variable is in double precision for increased accuracy. 


To execute the commands in meddiff.ado, we simply type meddiff with 
the appropriate arguments. For example, 


. * Execute program meddiff for an example 
. meddiff ltotexp totchr 
Median of first variable - second variable = 4.2230513 


The meddiff.ado file needs to be in a directory that Stata automatically 
accesses. For a Microsoft Windows computer, these directories include 


C:\ado and c:\Program Files\Stata17 and the current directory. See 
[u] 17 Ado-files for further details. 


A.3 Program debugging 


This section provides advice relevant to even the most basic uses of Stata. 


There are two challenges: to get the program to execute without 
stopping because of an error and to ensure that the program is doing what is 
intended once it is executing. 


We focus here on the first challenge. The simplest way to debug a 
program is to work with a simplified example and print out intermediate 
results. Stata also provides error messages and a trace facility to track every 
step of the execution of a program. 


The second challenge is easily ignored, but it should not be skipped. 
Come up with an example where there is a known result or a way to verify 
the result. For example, to test an estimation procedure, generate many 
observations from a known data-generating process, and see whether the 
estimation procedure yields the known data-generating process parameters; 
see section 5.6. Printing intermediate results is again very helpful. In 
particular, always use the summarize command to verify that you are 
working with the intended dataset. 


A.3.1 Some simple tips 


A simple way to debug Stata code is to display the intermediate output. For 
example, in the following listing, we can see whether the correct dimension 
matrices are obtained. If the program failed, we could look at the 
intermediate results before the failure to see where the failure occurs. 


. * Display intermediate output to aid debugging 
. matrix accum XTX = totchr // Recall constant is added 
(obs=100) 


. matrix list XTX // Should be 2 x 2 


symmetric XTX[2,2] 
totchr _cons 
totchr 74 
_cons 48 100 


. Matrix vecaccum yTX = ltotexp totchr 
. matrix list yTX // Should be 1 x 2 


yTX[1,2] 
totchr _cons 
ltotexp 224.51242 453.36881 


. matrix bOLS = invsym(XTX)*(yTX)7 
. matrix list bOLS // Should be 2 x 1 


bOLS[2,1] 
ltotexp 
totchr .13530976 
_cons 4.4687394 


Even when there seems to be no problem, if the program is still being 
debugged, it can be useful to comment out an extraneous command, such as 
matrix list, rather than to delete the command, in case there is reason to 
use it again later. 


Debugging can be quicker and simpler if one works with a simplified 
program. For example, rather than work with the full dataset and many 
regressors, one might initially work with a small subset of the data and a 
single regressor. This may also reduce the chance that problems are arising 
merely because of data problems, such as multicollinearity. 


To further save time, one may find it worthwhile to use /* and */ to 
comment out those parts of the program that are not needed during the 
debugging exercise. This is especially the case for computationally 
intensive tasks that are not necessary, such as graphs to be used in the final 
analysis but not needed during the program development stage. 


A.3.2 Error messages and return code 


Stata produces error messages. The message given can be brief, but a fuller 
explanation can be obtained from the manual or directly from Stata. 


For example, if we regress y on x but one or both of these variables do 
not exist, we get 


. regress y X 


variable y not found 
r(111); 


For a more detailed explanation of the return code 111, type the command 


. search rc 111 


(output omitted ) 


If a Stata program is being debugged, then program failure can lead to 
an error message that is not at all helpful. More useful error messages can 
be given if the code is not embedded in a program. Thus, rather than work 
with a program in the program environment, it can be helpful at first to 
work with the commands in a Stata do-file but not within a program. For 
example, a nonprogram version of the meddiff program is 


. * Debug an initial nonprogram version of a program 
. tempvar y x diff 
. generate `y’ = ltotexp 
. generate `x’ = totchr 
. generate double “diff° = “y° - `x’ 
_pctile “diff, p(50) 
. scalar medylx = r(r1) 


display "Median of first variable - second variable = " medylx 


Median of first variable - second variable = 4.2230513 


A.3.3 Trace 


The trace command traces the execution of a program. To initiate a trace, 
type the command 


. set trace on 


(output omitted ) 


To stop the trace, type the command 


. set trace off 


The trace facility can generate a large amount of output. Thus, it can be 
more useful to manually insert commands that give intermediate results. 
The default is set trace off. 


A.4 Additional resources 


The [P] Stata Programming Reference Manual provides details. See also 
Baum (2016). 


Appendix B 
Mata 


Mata is a powerful matrix programming language comparable with Gauss, 
MATLAB, or R. Compared with the Stata matrix commands, it is 
computationally faster; supports larger matrices (Mata has no restriction on 
matrix size, so the only restriction is computer specific); has a wider range 
of matrix commands; and has commands that are closer in syntax to the 
matrix notation used in mathematics. 


Mata is a component of Stata that can be used on its own. Additionally, 
one can blend Stata and Mata functions. 


B.1 How to run Mata 


Mata commands are usually run in Mata, which is initiated by first giving 
the mata command in Stata. Single Mata commands can be given in Stata, 
and single Stata commands can be given in Mata. 


B.1.1 Mata commands in Mata 


Mata can be initiated by the Stata mata command. In Mata, the command 
prompt is a semicolon (:) rather than a period. Mata commands are 
separated by line breaks or by semicolons. To exit Mata and return to Stata, 
use the Mata end command. 


The following sample Mata session creates a 2 x 2 identity matrix, 1, 
and then displays the elements of matrix 1. 


. * Sample Mata session 
. mata 
mata (type end to exit) 
: I= I(2) 
: I 
[symmetric] 


: end 

For symmetric matrices, such as the identity matrix, only the lower triangle 
is listed. Here the unlisted (1, 2) element equals the listed (2, 1) element, 
which is 0. 


B.1.2 Mata commands in Stata 


A single Mata command can be issued in Stata by adding the mata prefix 
before the Mata command. 


For example, to create a 2 x 2 identity matrix, 1, and to display the 
elements of 1, type the commands 


. * Mata commands issued from Stata 
. mata: I = I(2) 


. mata: I 
[symmetric] 
1 2 
1 1 
2 0 1 


B.1.3 Stata commands in Mata 


Mata commands are distinct from Stata commands. One can enact a Stata 
command within a Mata program, however, by using the stata () function 
within Mata. 


For example, suppose we are in Mata and want to find the mean of the 
1totexp variable, which is in the Stata dataset currently in memory. In Stata, 
we would type the summarize ltotexp command. In Mata, we use the 
stata() function with the desired Stata command in double quotes as the 
argument. 


. // Stata commands issued from Mata 
. mata 

mata (type end to exit) 
: stata("summarize ltotexp") 


Variable Obs Mean Std. dev. Min Max 


ltotexp 100 4.533688 .8226942 1.098612 5.332719 


: end 


B.1.4 Interactive versus batch use 


There are differences between what is possible in Mata interactive use and 
what is possible in a Mata program. For example, comments cannot be 
included in Mata in interactive use. 


B.1.5 Mata help 


We provide some basic Mata code in this appendix. The two-volume set of 
Mata manuals is very complete but does not provide as many data-oriented 
examples as appear in the other Stata manuals. 


The help command for Mata works at either Stata’s dot prompt or 
Mata’s colon prompt. 


If you know the name of the matrix command, operator, or function, then 
type the help mata name command. For example, if you know that the det () 
function takes the determinant of a matrix, then type the command 


: help mata det 
(output omitted ) 


In this example, the command was typed in Mata, but exactly the same help 
command can be typed in Stata. 


If you do not know the specific name, then it is harder. For example, 
suppose we want to find help on the category matrix. Then no specific help 
entry is obtained after help mata matrix. However, 


: help m4 matrix 
(output omitted ) 


does work because m-4 is the relevant section of the manuals for Mata. More 
generally, the command is help m# name, but this requires knowing the 
relevant section of the manuals. 


Often, one must start with the help mata command and then selectively 
choose from the subsequent entries. 


B.2 Mata matrix commands 


We present the various basics of creating matrices and matrix operators and 
functions. Explanatory comments begin with // because Mata does not 
recognize comments beginning with x. 


B.2.1 Mata matrix input 


Matrix input by hand 


Matrices can be input by hand. For example, consider a 2 x 3 matrix a with 
the first row entries 1, 2, and 3 and the second row entries 4, 5, and 6. This 
can be defined as follows: 


: // Create a matrix 
: A= (1,2,3 \ 4,5,6) 


As with the matrix define command in Stata, a comma is used to separate 
column entries, and a backslash is used to separate rows. 


To see the matrix, simply type the matrix name: 


: // List a matrix 


Identity matrices, unit vectors, and matrices of constants 


An n x n identity matrix is created with 1 (n). For example, 


: // Create a 2x2 identity matrix 
: I= I(2) 


A 1 x n row vector with Os in all entries aside from the ;th is created 
with e (į, n). For example, 


: // Create a 1x5 unit row vector with 1 in second entry and Os elsewhere 
: e = e(2,5) 


An r x c matrix of constants equal to the value v is created with J(r,c,v 
). For example, 


: // Create a 2x5 matrix with entry 3 
: J = J(2,5,3) 


Range operators create vectors with entries that increment by one for 
each entry by using a. .b for a row vector and a: :b for a column vector. For 
example, 


: // Create a row vector with entries 8 to 15 
:a=8..15 


creates a row vector with the entries 8, 9, ..., 15. 
For creation of other standard matrices, type help m4 standard. 


Matrix input from Stata data 


Matrices can be associated with variables in the current Stata dataset in 
memory by using the Mata st_view() function. 


For example, suppose the current Stata dataset includes the variables 
ltotexp, totchr, and cons. Then, 


: // Create Mata matrices from variables stored in Stata 
: St_view(y=., ., "ltotexp") 


: st_view(X=., ., ("totchr", "cons")) 


associates the column vector, y, with the observations on the variable 
ltotexp and a matrix, x, with the observations on the variables totchr and 


cons. 


A brief summary of the syntax follows for the second st_ view () 
function above. The first entry is x=. because this eliminates the need to 
previously define the vector x. If instead we had first entered simply x, we 
would have received the error message <istmt>: 3499 X not found. The 
second entry is a period, meaning that all the observations will be selected. 
The argument could instead be a list of observations. The third entry is a row 
vector selecting the particular variables, with variable names given in quotes 
and commas separating the column entries in the row vector. If totchr and 
cons were the 31st and 45th entries in the dataset, we could equally well 
type st_view(X=.,., (31,45)). 


The st_view() function creates a view of the Stata dataset that does not 
require that the actual data be physically loaded into Mata, saving time and 
memory. For example, to subsequently form the OLs estimator (X’X)~!X’y 
in Mata, one need load only the K x K matrix (X’X)~! and the K x 1 
matrix X’y, not the much larger N x K matrix X. 


The related st_data() function does actually load matrices, but this is 
usually not necessary. As an example, 


. // Create a Mata matrix from variables stored in Stata 
. Xloaded = st_data(., ("totchr", "cons")) 


creates a matrix, Xloaded, with the jth row the jth observation on the totchr 
and cons variables. 


Matrix input from Stata matrix 


Mata matrices can be created from matrices created by Stata commands, 
using the Mata st_matrix() function. For example, 


: // Read Stata matrix (created in first line below) into Mata 
: stata("matrix define B = I(2)") 


: C = st_matrix("B") 


: C 
[symmetric] 
1 2 
1 1 
2 0 1 


The st_matrix() function can also be used to transfer a Mata matrix to 
Stata; see appendix B.2.6. 


Stata interface functions 


Stata interface functions begin with st_ and link matrices and data in Mata 
with those in Stata. Examples already given are st_ view(), st_data(), and 
st_matrix(). The st _addvar() and st_store() functions are presented in 
appendix B.2.6. A summary is given in [M-4] Stata, and individual st_ 
functions are given in [M-5] Intro. 


B.2.2 Mata matrix operators 


The arithmetic operators for conformable matrices are + to add, - to subtract, 
x to multiply, and # for the Kronecker product. The multiplication command 
can also be used for multiplication by a scalar, for example, 2*A or a*2, and 
scalar division is possible, for example, a/2. A scalar can be raised to a 
scalar power, for example, a*b. The matrix -a is the negative of a. 


A single apostrophe, ’, gives the matrix transpose (or conjugate 
transpose if the matrix is complex). To compute a’ a, we can use A’A or 
A’ *A, 


The Kronecker product of two matrices is given by aB. Ifa ism xn 
and Bis r x s, then A#B iS mr x ns. 


Element-by-element operators 


Key arithmetic operators are the colon operators for element-by-element 
operations. A leading example is element-by-element multiplication of two 
matrices of the same dimension (the Hadamard product). Then c=a: *B has 
an ijth element equal to the ijth element of a times the ijth element of B. 


Element-by-element multiplication of a column vector and a matrix is 
possible if they have the same number of rows. Similarly, element-by- 
element multiplication of a row vector and a matrix 1s possible if they have 
the same number of columns. For the column vector case, type 


: // Element-by-element multiplication of matrix by column vector 
> b = 2::3 

: J = J(2,5,3) 

: b:*J 


The column vector b has the entries 2 and 3, and the 2 x 5 matrix J has all 
entries equal to 3. The first row of matrix J is multiplied by 2 (the first entry 
in column vector b), and the second row of J is multiplied by 3 (the second 
entry in b). 


Let w be an N x 1 column vector and x be an N x K matrix with 7th 
row x/. Then w:*x is the N x K matrix with the jth row w;x;, and (w:*x)’x 
. . N 
is the K x K matrix equal to Da WiX;X!. 

Other colon operators are available for division (:/), subtraction (:-), 
power (:^), equality (:==), inequality (: !=), specific inequalities (such as 
:>=), “and” (:«), and “or” (: |). These operators are a particular advantage of 


a matrix programming language. 


Additional classes of operators are detailed in [M-2] Intro. 
B.2.3 Mata functions 


Standard matrix functions have arguments provided in parentheses, (). 


Scalar and matrix functions 


Some matrix commands produce scalars, for example, 


: // Scalar functions of a matrix 
: r = rows(A) 


TTE 
2 


Commonly used examples include those for matrix determinant (det () ), 
rank (rank () ), and trace (trace () ). Statistical functions include mean (). 


Some matrix commands produce matrices by element-by-element 
transformation, for example, 


: // Matrix function that returns matrix by element-by-element transformation 
: D = sqrt (A) 


: D 

1 2 3 
1 1 1.414213562 1.732050808 
2 2 2.236067977 2.449489743 


Mathematical functions include absolute value (abs () ), sign (sign () ), 
natural logarithm (1n () ), exponentiation (exp () ), log factorial 
(Infactorial () ), modulus (mod () ), and truncation to integer (trunc () ). 
Among the statistical functions are uniform draws (runiform() ), standard 
normal density (normal () ), and many other densities and cumulative 
distribution functions. 


Some matrix commands produce vectors and matrices by acting on the 
whole matrix. A leading example is matrix inversion, discussed below. The 
mean () function finds the mean of columns of a matrix, and corr () forms a 
correlation matrix from a variance matrix. 


Eigenvalues and eigenvectors of a square matrix can be obtained by 
using the Mata eigensystem() function, for example, 


: // Calculate eigenvalues and eigenvectors 
: E= (1, 2 \ 4, 3) 


: lamda =. 
: eigvecs =. 


: eigensystem(E,eigvecs,lamda) 


: lamda 
1 2 
1 5 -1 
: eigvecs 
1 2 
1 -.447213595 -.707106781 
2 -.894427191 . 707106781 


The eigenvalues are in the row vector lamada, and the eigenvectors are the 
corresponding columns of the square matrix eigvecs. The command 
requires that lamda and eigvecs already exist, so we initialized them as 
missing values. 


Mata has many functions; see [M-4] Intro for an index and guide to 
functions. 


Matrix inversion 


There are several different matrix inversion functions. The cholinv () 
function, the fastest, computes the inverse of a positive-definite symmetric 
matrix. The invsym() function computes the inverse of a real symmetric 
matrix; luinv() computes the inverse of a square matrix; qrinv () computes 
the generalized inverse of a matrix; and pinv() computes the Moore— 
Penrose pseudoinverse. 


For the full-column rank matrix x, the matrix x’ x is positive-definite 
symmetric, SO cholinv(x’x) is best. But this function will fail if x’ x is not 
precisely symmetric because of a rounding error in calculations. The 
makesymmetric() function forms a symmetric matrix by copying elements 
below the diagonal into the corresponding position above the diagonal, for 
example, 


: // Use of makesymmetric() before cholinv() 
: F = 0.5*1I(2) 


: G = makesymmetric(cholinv(F‘F)) 
: E 
[symmetric] 


B.2.4 Mata cross products 


The matrix cross () function creates matrix cross products. For example, 
cross (X,X) forms Xx’ x, cross (X, Z) forms x’ z, and cross (X, w, Z) forms 
X’diag (w) Z. For the data loaded earlier into x and y, the OLS estimator can 
be computed as 


: // Matrix cross product 
: beta = (cholinv(cross(X,X)))*(cross(X,y)) 


: beta 
1 


1 . 1353097647 
2 4 . 468739434 


These estimates equal those given in appendix A.1.2. 


The advantages of using cross (), rather than the arithmetic 
multiplication operator, are faster computation and less memory use. Rows 
with missing observations are dropped, whereas x’ z will produce missing 
values everywhere if there are any missing observations. And cross (X’X) 
produces a symmetric result so that there is no longer a need to use the 
makesymmetric() function before cholinv() Or invsym(). 


B.2.5 Mata matrix subscripts and combining matrices 


The (i, 7)th entry in a matrix is denoted by [i, 4]. For example, to set the 
(1,2) entry in matrix a to equal the (1,1) entry, type the command 


: // Matrix subscripts 
: A[1,2] = A[1,1] 


For a column vector, the ¿th entry is denoted by [i,1] rather than simply 
[i]. Similarly, for a row vector, the jth entry is denoted by [1,4] rather than 
simply [j]. 


To add columns to a matrix, use the horizontal concatenation operator, a 
comma. Thus, a,8 adds the columns of B after the columns of a, assuming 
the two matrices have the same number of rows. For example, type 


: // Combining matrices: add columns 
: M=A,A 


To add or append rows to a matrix, use the vertical concatenation 
operator, a backslash. Thus, a \ B adds the rows of B after the rows of a, 
assuming the two matrices have the same number of columns. For example, 


type 


: // Combining matrices: add rows 


> N=A\A 
: N 

1 2 3 
1 1 1 3 
2 4 5 6 
3 1 1 3 
4 4 5 6 


A submatrix can be extracted from a matrix by using list subscripts that 
give as a first argument the rows being extracted and as a second argument 


the columns being extracted. For example, to extract the submatrix formed 
by rows 1-2 and columns 5—6 of the matrix m, we type 


: // Form submatrix using list subscripts 
: 0 = M[(1\2), (5::6)] 


: O 

1 2 
1 1 3 
2 5 6 


An alternative is to use range subscripts that give the subscripts for the 
upper-left entry and the lower-right entry of the portion to be extracted. 
Thus, type 


: // Form submatrix using range subscripts 
: P = M(l1,5 \ 2,61] 


: P 

1 2 
1 1 3 
2 5 6 


Where both list and range subscripts can be used, range subscripts are 
preferred because they execute quicker. For more details, see [M- 
2] Subscripts. 


B.2.6 Transferring Mata data and matrices to Stata 


Mata functions beginning with st_ provide an interface with Stata. 


Creating Stata matrices from Mata matrices 


A Stata matrix can be created from a Mata matrix by using the Mata 
st_matrix() function. 


For example, to create a Stata matrix, Q, from the Mata matrix P and then 
list the Stata matrix, type 


: // Output Mata matrix to Stata 
st_matrix("Q", P) 


stata("matrix list Q") 


Q[2,2] 

c1 c2 
ri 1 3 
r2 5 6 


Section 3.9 provides an example where the parameter vector b and the 
estimate of the variance—covariance matrix of the estimator are computed in 
Stata and passed from Mata to Stata with st_matrix (), and then results are 
posted and nicely displayed by using the Stata ereturn command. 


Creating Stata data from a Mata vector 


The st_addvar() function adds a new variable to a Stata dataset, though it 
creates only the name of the variable and not its values. The st_ store () 
function modifies the values of a variable currently in a Stata dataset. Thus, 
to create a new variable in Stata and give that new variable values, we type 
st_addvar() followed by st_store(). 


Recall that x is a matrix with the variables totchr and cons and that 
beta is a column vector of OLS coefficients from the regression of 1totexp 
on totchr and cons. We create the vector of fitted values, ynat, in Mata, 
pass these to Stata as the ltotexphat variable, and use the summarize 
command to check the results. We have 


: // Output Mata matrix to Stata 
: yhat = X*beta 
: st_addvar("float", "ltotexphat") 
46 
st_store(.,"ltotexphat", yhat) 


: stata("summarize ltotexp ltotexphat") 


Variable Obs Mean Std. dev. Min Max 


ltotexp 100 4.533688 . 8226942 1.098612 5.332719 
ltotexphat 100 4.533688 0970792 4.46874 4.874669 


As expected after OLS regression, the average of the fitted values equals 
the average of the dependent variable. 


B.3 Programming in Mata 


Detailed examples using Mata code are presented in 

sections 3.9, 16.2, 16.8, 30.3, and 30.4, and appendix C.2. These examples 
pass data from Stata to Mata, calculate parameter estimates and estimates of 
the variance—covariance matrix of the estimator in Mata, and pass these back 
to Stata. 


Here we present a very introductory treatment of programming in Mata. 
B.3.1 Mata program 


As an example, we create a Mata program, calcsun, that calculates the 
column sum of a column vector. This example, based on the example in [M- 
1] Ado, is purely illustrative because the Mata colsum() function does this 
anyway. 


The column vector x is obtained from a variable named varname in the 
Stata dataset currently in memory by using the st_view() function. The 
varname string is a program argument supplied when the program is called. 
The actual calculation of the column sum is done with the Mata colsum() 
function. The result is put in the real scalar resultissum, a second program 
argument. To apply the program to the 1totexp variable, we call the 
calcsum program with the "1totexp" and sum arguments. The result is in 
sum, and to see the result, we simply type sum. We have 


. Mata: 
mata (type end to exit) 
void calcsum(varname, resultissum) 
{ 
st_view(x=., ., varname) 
resultissum = colsum(x) 


VVV Vee 


} 
sum =. 
calcsum("ltotexp", sum) 


: sum 
453.3688121 


: end 


The result, 453.3688, 1s that expected because the sample mean of the 
100 observations on ltotexp was 4.533688 from output given in 
appendix B.2.6. 


B.3.2 Mata program with results output to Stata 


The preceding Mata program passes the result, resultissum, back to Mata. 
We next consider a variation that passes the result, renamed sum, to Stata. 


To transfer the result to Stata, we use the Mata st_numscalar() function 
and drop the second argument in the calcsum program because the result is 
no longer passed to Mata. Because the result is now in Stata, we need to use 
the Stata display command to display the result. We have 


. mata: 


mata (type end to exit) 
void function calcsum2(varname) 
{ 

st_view(x=., ., varname) 

st_numscalar("r(sum)",colsum(x) ) 


VVV Vee 


} 
calcsum2("ltotexp") 


i stata("display r(sum)") 
453.36881 


: end 


B.3.3 Stata program that calls a Mata program 


The preceding two programs call the Mata program from within Mata. We 
now create a Stata program, varsun, that calls the Mata program calcsum2 
from Stata. 


The varsum program uses standard Stata syntax (see appendix A.2.7) 
rather than positional arguments. This syntax recognizes the argument in the 
call varsum 1totexp as a variable name that is placed in varlist. The Mata 
program calcsum2, already defined in the preceding section, is called with 
varname being the variable name in varlist. We have 


. program varsum 


1. version 17 
2. syntax varname 
3. mata: calcsum2(""varlist~") 
4. display r(sum) 
5. end 
. varsum ltotexp 
453.36881 


B.3.4 Using Mata in ado-files 


The main construct for writing new commands in Stata is a Stata ado-file. 
When computation in Mata is convenient, the ado-file can include Mata code 
or call a Mata function. 


A Mata function defined in an ado-file requires compilation every time it 
is called. To save computer time, one can reuse compiled functions without 
the need for recompilation by using the mata mosave and mata mlib 
commands. For details, see [M-1] Ado, which presents the preceding column 
sum example in much more generality. 


B.3.5 Declarations 


The examples in section 16.8 and appendix C include a Mata program with 
arguments to be passed to and from the Mata optimize () function. 


The code in these examples does not declare matrices and scalars ahead 
of their use. This makes coding easier but makes it more likely that errors 
may go undetected. For example, if an operation is expected to create a 
scalar but a matrix is the result, there may be no message to this effect. If 
instead we had previously declared the expected result to be a scalar, then an 
error would occur if a matrix was erroneously created. 


The following Mata code rewrites the optimize () function evaluator 
pmlegf2 program in appendix C.2.3 to declare the types of all program 
arguments and all other variables used in the program. 


void pmlegf2(real scalar todo, 
real rowvector b, 
real colvector y, 
real matrix X, 
real colvector lndensity, 
real matrix g, 
real matrix H) 


real colvector Xb 

real colvector mu 

Xb = X*b° 

mu = exp(Xb) 

Indensity = -mu + y:*Xb - lnfactorial(y) 
if (todo == 0) return 

g = (y-mu) :*X 

if (todo == 1) return 

H = - cross(X, mu, X) 


VVVVVV VV VV VV VV VV Moe 


} 


The Mata command mata set matastrict on requires that declarations be 
provided. 


B.4 Additional resources 


The [M] Stata Mata Reference Manual provides all the Mata commands. 
Gould (2018) provides an exposition of Mata, and Baum (2016) includes 
coverage of Mata. 


Appendix C 
Optimization in Mata 


The Stata mi command is useful for objective functions of form 

Q(0) = peas a q(yi, Xi, 9)- Not all optimization problems fall into this 
class, however, in which case one needs to use the Mata moptimize () 
function or the more flexible Mata optimize () function. 


This appendix presents these Mata optimization functions, building on 
the optimization methods given in chapter 16 and Mata commands 
summarized in appendix B. 


Note that if the simpler Stata mı command can be used, then there is no 
benefit in using the moptimize () function because the mı command is just a 
front end to the moptimize () function. The moptimize () function in turn 
uses some features of the optimize () function. 


Applications in this appendix use the same data as in chapter 16. 


. * Read in data 
. use mus210mepsdocvisyoung, clear 
(A.C.Cameron & P.K.Trivedi (2022): Microeconometrics Using Stata, 2e) 


. qui keep if year02 == 


C.1 Mata moptimize() function 


The Mata moptimize() function is especially useful for models with 
q(y:, Xi, 9) of single-index form q(y;, x40) or of multi-index form 
Yn Kuir Kna): 


C.1.1 moptimize() evaluators If, d, gf, and q 


An evaluator function is one that provides the formula to calculate the 
function being maximized and defines the vector that is the maximand. It 
can additionally provide the formula for first and second derivatives and 
define any data on a dependent variable and regressors that determine the 
value of the objective function Q(0). 


There are several distinct types of evaluator functions used by the 
moptimize() function. For all evaluators, the first argument is M, which is a 
handle, initially obtained from m optimize init(), that can then be passed 
as an argument to other moptimize() functions. 


Detailed examples of evaluators are given below. Most evaluators have 
syntax 


evaluator(M, todo, b, fu, S, H) 


where M is a handle, todo is a real scalar equal to 0, 1, or 2 depending on 
whether a gradient and Hessian are provided, b is the coefficient vector, fv 
defines the objective function, S is a gradient vector or matrix, and H is a 
Hessian matrix. 


C.1.2 moptimize() functions 


The moptimize() functions fall into four broad categories. Functions that 
define the optimization problem, such as the name of the evaluator and the 
iterative technique to be used, begin with moptimize init. Functions that 


lead to optimization are moptimize() Or moptimize evaluate (). Functions 
that return results begin with moptimize return. And function 
moptimize query () lists optimization settings and results. 


A complete listing of these functions is given in [M-5] moptimize(). The 
following examples illustrate various moptimize() functions to perform 
Poisson regression of y on x and obtain coefficient estimates and an estimate 
of the associated variance—covariance matrix of the estimator. This simple 
example has only one dependent variable and only one index. 


C.1.3 moptimize() methods If, 1f0, If1, and If2 


The type 1£ evaluator defines the individual contributions q;(0) to the 
objective function and has syntax 


evaluator( M, b, fv) 


where b is the 1 x K coefficient vector and fv is an N x 1 column vector 
with each observation’s contribution to the objective function Q(b); Q(b) is 
then the column sum of p. 


The following program defines an 1f evaluator for moptimize(). The 
first line y1 = moptimize util depvar(M, 1) returns a column vector y1 
containing the values of the dependent variable whose values are set later by 
MOplImaeze Ine depvar (Mp: 1, " "). The second line xb = 
moptimize util xb(M, b, 1) returns a column vector containing the 
values of the index x10, where the regressors are defined in the 
moptimize init eq indepvars(M, 1, " ") program below. The third line 
provides a column vector with the log density for each observation, that is, 
for each observation’s contribution to the objective function. 


. * moptimize() method 1f: Program poissonlf gives lnf(y_i) 
. mata: 


mata (type end to exit) 
mata clear 


function poissonlf(transmorphic M, real rowvector b, fv) 


> { 

> yi = moptimize_util_depvar(M, 1) // N x 1 vector y 
> Xb = moptimize_util_xb(M, b, 1) // N x 1 vector Xb 
> fv = -exp(Xb):+ (y1:*Xb) :- Infactorial(y1) 

> } 

: end 


Given that definition of the evaluator, we move to optimization. The first 
line is always needed and defines m as the handle (we could give some other 
letter here). The second line specifies to use program poisson1f as the 
evaluator. The third line says that this is an 1f evaluator. The fourth line 
defines the single dependent variable, and the fifth line defines the variables 
used to form the single index for this example. The moptimize (m) function 
initiates the optimization and moptimize result display (M) function 
reports the results with default standard errors. 


. * moptimize() method 1f: Implement with default standard errors 
. Mata: 
mata (type end to exit) 
M = moptimize_init() 


moptimize_init_evaluator(M, &poissonlf()) 
moptimize_init_evaluatortype(M, "lf") 
moptimize_init_depvar(M, 1, "docvis") 


moptimize_init_eq_indepvars(M, 1, "private chronic female income") 


: moptimize(M) 

initial: f(p) = -33899.609 
alternative: f(p) = -28031.767 
rescale: f(p) = -24020.669 
Iteration 0: f(p) = -24020.669 
Iteration 1: f(p) = -24010.663 
Iteration 2: f(p) = -18541.285 
Iteration 3: f(p) = -18503.604 
Iteration 4: f(p) = -18503.549 


Iteration 5: f(p) = -18503.549 
moptimize_result_display (M) // Default standard errors 
Number of obs = 4,412 


docvis Coefficient Std. err. z P>lz| [95% conf. interval] 
private . 7986654 .027719 28.81 0.000 . 7443372 .8529936 
chronic 1.091865 .0157985 69.11 0.000 1.060901 1.12283 
female .4925481 .0160073 30.77 0.000 .4611744 .5239218 
income .003557 .0002412 14.75 0.000 . 0030844 . 0040297 
_cons -.2297263 .0287022 -8.00 0.000 -.2859816 -.1734711 

: end 


The results are identical to those obtained using the poisson command with 
default standard errors. 


The function moptimize result display(M, "robust") reports the 
results with heteroskedastic—robust standard errors. 


* moptimize() method 1f: Implement with heteroskedastic-robust standard errors 
. mata: 


mata (type end to exit) 
moptimize_result_display(M, "robust") // Heteroskedastic-robust 
> standard errors 


Number of obs = 4,412 


Robust 
docvis | Coefficient std. err. Zz P>|z| [95% conf. interval] 
private . 7986654 . 1090015 7.33 0.000 . 5850265 1.012304 
chronic 1.091865 .0559951 19.50 0.000 . 9821167 1.201614 
female .4925481 .0585365 8.41 0.000 . 3778187 .6072775 
income .003557 .0010825 3.29 0.001 .0014354 .0056787 
_cons -. 2297263 . 1108733 -2.07 0.038 - . 4470339 -.0124188 


: end 


To obtain cluster—robust standard errors, we first need to define the 
cluster identification variable, here age, using the 
moptimize init _cluster(M, " ") function. The 
moptimize result display (M) function then reports cluster—robust 
standard errors. 


. * moptimize() method 1f: Implement with cluster--robust standard errors 
. mata: 


mata (type end to exit) 
M = moptimize_init() 


moptimize_init_evaluator(M, &poissonlf()) 
moptimize_init_evaluatortype(M, "lf") 

moptimize_init_depvar(M, 1, "docvis") 
moptimize_init_eq_indepvars(M, 1, "private chronic female income") 
moptimize_init_cluster(M, "age") // Define the cluster variable 


moptimize (M) 


initial: f(p) = -33899.609 
alternative: f(p) = -28031.767 
rescale: f(p) = -24020.669 
Iteration 0: f(p) = -24020.669 
Iteration 1: f(p) = -24010.663 
Iteration 2: f(p) = -18541.285 
Iteration 3: f(p) = -18503.604 
Iteration 4: f(p) = -18503.549 
Iteration 5: f(p) = -18503.549 


moptimize_result_display (M) // Now gives cluster--robust standard errors 
Number of obs = 4,412 
(Std. err. adjusted for 40 clusters in age) 


Robust 
docvis | Coefficient std. err. Zz P>|z| [95% conf. interval] 
private . 7986654 . 1496492 5.34 0.000 . 5053583 1.091972 
chronic 1.091865 .0603102 18.10 0.000 . 9736593 1.210071 
female -4925481 . 0686028 7.18 0.000 . 3580891 .627007 
income .003557 .0011792 3.02 0.003 .0012458 .0058683 
_cons - .2297263 . 1453959 -1.58 0.114 -.514697 .0552443 


: end 


The type 1£* evaluators additionally allow for specification of first and 
second derivatives at the individual observation level. The syntax is 


evaluator(M, todo, b, fu, S, H) 


where todo is a real scalar equal to 0, 1, or 2 (for, respectively, 1£0, 1£1, and 
1£2); bis the 1 x K coefficient vector; fv is again an N x 1 column vector 
with each observation’s contribution to the objective function; S is an 

N x K matrix of scores, whose column sum gives the gradient vector; and 
Hisa K x K Hessian matrix. 


We next consider an 1£2 evaluator for moptimize() that provides an 
N x K matrix of first derivatives, where each row is for a single 
observation, anda K x K matrix of second derivatives of the objective 
function. This latter matrix requires summing over observations, which is 
done using the moptimize util _matsum() function. 


* moptimize() method 1f2: Add first and second derivatives to poissonlf 


. mata: 


VVVVVVV VV VV VV VM 


mata (type end to exit) 
mata clear 


function poissonlf2(transmorphic M, real scalar todo, 


real rowvector b, fv, S, H) 
{ 
y1 = moptimize_util_depvar(M, 1) // N x 1 vector y 
Xb = moptimize_util_xb(M, b, 1) // N x 1 vector Xb 
fv = -exp(Xb):+ (y1:*Xb) :- Infactorial(y1) 
if (todo>=1) { 
s1 = (y1:-exp(Xb)) 
S = sl 
if (todo==2) { 
h11 = -exp(Xb) 
H11 = moptimize_util_matsum(M, 1,1, h11, 0) 
H = H11 
} 
} 
} 


: end 


Optimization then yields 


. * moptimize() method 1f2: Implement with heteroskedastic-robust standard errors 
. mata: 

mata (type end to exit) 
: M = moptimize_init() 


: moptimize_init_evaluator(M, &poissonlf2()) 

: moptimize_init_evaluatortype(M, "1f2") 

: moptimize_init_depvar(M, 1, "docvis") 

: moptimize_init_eq_indepvars(M, 1, "private chronic female income") 


: moptimize (M) 
initial: 


f(p) = -33899.609 
alternative: f(p) = -28031.767 
rescale: f(p) = -24020.669 
Iteration 0: f(p) = -24020.669 
Iteration 1: f(p) = -18845.463 
Iteration 2: f(p) = -18510.192 
Iteration 3: f(p) = -18503.551 
Iteration 4: f(p) = -18503.549 


Iteration 5: f(p) = -18503.549 


: moptimize_result_display(M, "robust") // Robust standard errors 


Number of obs = 4,412 


Robust 
docvis Coefficient std. err. Zz P>|z| [95% conf. interval] 
private . 7986654 .1090015 7.33 0.000 . 5850265 1.012304 
chronic 1.091865 .0559951 19.50 0.000 .9821167 1.201614 
female .4925481 .0585365 8.41 0.000 .3778187 .6072775 
income .003557 .0010825 3.29 0.001 .0014354 .0056787 
_cons -.2297263 . 1108733 -2.07 0.038 -.4470339 -.0124187 
: end 


C.1.4 moptimize() methods d0, d1, and d2 


The type a+ evaluators directly define the objective function Q(@), rather 


than the individual contributions q;(@). The syntax is 


evaluator(M, todo, b, fu, S, H) 


where todo is a real scalar equal to 0, 1, or 2 (for, respectively, ao, a1, and 
d2); bis the 1 x K coefficient vector; fv is the scalar Q(b); Sisa 1 x K 


gradient vector; and H isa K x K Hessian matrix. 


The following example provides a d2 evaluator. Now the first derivatives 
are with respect to the entire objective function, yielding a K x 1 column 
vector that is constructed from individual observations using the 
moptimize util vecsum() function. The K x K matrix of second 
derivatives is calculated in the same way as for the 1£2 evaluator. 


. * moptimize() method d2: Program poissond2 
. Mata: 


mata (type end to exit) 
mata clear 


function poissond2(transmorphic M, real scalar todo, 


> real rowvector b, fv, g, H) 

> { 

> yi = moptimize_util_depvar(M, 1) // N x 1 vector y1 
> Xb = moptimize_util_xb(M, b, 1) // N x 1 vector Xb 
> fv = moptimize_util_sum(M, -exp(Xb):+ (y1:*Xb) :- lnfactorial(y1)) 
> if (todo>=1) { 

> s1 = (y1:-exp(Xb)) 

> gi = moptimize_util_vecsum(M, 1, s1, fv) 

> g = gi 

> if (todo==2) { 

> hii = -exp(Xb) 

> H11 = moptimize_util_matsum(M, 1,1, h11, fv) 

> H = H11 

> } 

> } 

> } 


: end 


Optimization then yields 


. * moptimize() method d2: Can implement only with default standard errors 
. mata: 


mata (type end to exit) 
M = moptimize_init() 


moptimize_init_evaluator(M, &poissond2()) 
moptimize_init_evaluatortype(M, "d2") 

moptimize_init_depvar(M, 1, "docvis") 
moptimize_init_eq_indepvars(M, 1, “private chronic female income") 


: moptimize (M) 
initial: f(p) = -33899.609 


alternative: f(p) = -28031.767 
rescale: f(p) = -24020.669 
Iteration 0: f(p) = -24020.669 
Iteration 1: f(p) = -18845.463 
Iteration 2: f(p) = -18510.192 
Iteration 3: f(p) = -18503.551 
Iteration 4: f(p) = -18503.549 
Iteration 5: f(p) = -18503.549 


moptimize_result_display (M) 
Number of obs = 4,412 


docvis Coefficient Std. err. z P>|zl [95% conf. interval] 
private . 7986654 .027719 28.81 0.000 . 7443372 . 8529936 
chronic 1.091865 .0157985 69.11 0.000 1.060901 1.12283 
female -4925481 .0160073 30.77 0.000 .4611744 . 5239218 
income .003557 .0002412 14.75 0.000 . 0030844 . 0040297 
_cons -.2297263 .0287022 -8.00 0.000 -.2859816 -.1734711 

: end 


The resulting optimization can compute only default standard errors 
because robust standard errors require the first derivative for each 
observation, whereas program poissond2 provided only the first derivative 
of the entire objective function. 


. * moptimize() method d2: Cannot obtain heteroskedastic-robust standard errors 
. mata: 
mata (type end to exit) 
moptimize_result_display(M, "robust") 


Number of obs = 4,412 


Robust 

docvis Coefficient std. err. Zz P>|zl [95% conf. interval] 
private . 7986654 
chronic 1.091865 

female -4925481 

income .003557 

_cons -— .2297263 

: end 


The following code manually computes heteroskedastic—robust standard 
errors given the preceding parameter estimates. 


. * moptimize() method d2: Compute heteroskedastic-robust standard errors manually 
. capture generate cons == // We need to add a constant here 


. mata: 


mata (type end to exit) 
b = moptimize_result_coefs(M) // Estimates from previous moptimize() 


st_view(X=., ., tokens("private chronic female income cons")) 
st_view(y=., ., tokens("docvis")) 

N = rows (X) 

Xb = X*b° 


mu = exp(Xb) 

XmuXinv = cholinv(cross(X,mu,X)) 

residsq = (y-mu) :*(y-mu) 

Vrobust = (N/(N-1))*XmuXinv’ (cross(X,residsq,X))*XmuXinv 
st_matrix("b",b) // Pass results from Mata to Stata 
st_matrix("V",Vrobust) // Pass results from Mata to Stata 


: end 


. Matrix colnames b = private chronic female income cons 
. matrix colnames V = private chronic female income cons 
. matrix rownames V = private chronic female income cons 
. ereturn post b V 


. ereturn display 


Coefficient Std. err. Zz P>lz| [95% conf. interval] 

private . 7986654 . 1090015 7.33 0.000 . 5850265 1.012304 
chronic 1.091865 .0559951 19.50 0.000 . 9821167 1.201614 
female . 4925481 .0585365 8.41 0.000 .3778187 .6072775 
income .003557 .0010825 3.29 0.001 .0014354 .0056787 
cons -.2297263 . 1108733 -2.07 0.038 -.4470339 -.0124187 


C.1.5 moptimize() methods gf0, gf1, and gf2 


The type g£ evaluators are a variation of type 1£* evaluators that permit the 
objective function to be of form ye | q:(0), where q;(-) is an L x 1 vector 


rather than a scalar. This is especially useful for fitting panel-data models. 
The syntax is 


evaluator( M, todo, b, fu, S, H) 


where todo is a real scalar equal to 0, 1, or 2 (for, respectively, gf0, gf1, and 
gf2); bis the 1 x K coefficient vector; fv is an L x 1 vector, where ZŁ is the 
number of independent elements; Sis an L x K gradient vector; and H is a 

K x K Hessian matrix. 


For the current cross-sectional example, there is no need to use a gf* 
evaluator because q;(-) is scalar. 


C.1.6 moptimize(Q) methods q0 and q1 


q0 and qi evaluators are for an objective function of the quadratic form 
h(@)'Wh(@), as is the case for minimum chi-squared estimators. In this 
case, the objective function is minimized. 


The type q evaluators are used in the special case of a quadratic objective 
function Q(b) = r(b)’/Wr(b) and have syntax 


evaluator( M, todo, b, r, S) 


where todo is a real scalar equal to 0 or 1 (for, respectively, q0 or q1); b is 
the 1 x K coefficient vector; ris an L x 1 vector, where J, is the number of 
independent elements; and Sis an L x K gradient vector. 


As an example, we consider the Poisson maximum-likelihood estimator 
(MLE), which has first-order conditions $`; {y; — exp(x,@)}x; = 0. This is 
equivalent to maximizing the quadratic form h(3)/h(3), where 
h(B) = X; {yi — exp(x)G)}x;. Here L = K and W is the K x K identity 
matrix. 


The q0 evaluator function for this example is 


. * moptimize() poisson q0: Poisson example 
. Capture generate cons == // We need to add a constant here 
. mata: 
mata (type end to exit) 
: mata clear 


: function poissongO(transmorphic M, real scalar todo, 


> real rowvector b, r, S) 

> { 

> y = moptimize_util_depvar(M, 1) // N x 1 vector y 

> Xb = moptimize_util_xb(M, b, 1) // N x 1 vector Xb 

> st_view(X=., ., tokens("private chronic female income cons") ) 
> mu = exp(Xb) // N x 1 vector mu 

> r = X°(y-mu) // Xx x 1 vector r 

> } 


note: argument todo unused. 
note: argument S unused. 


: end 


The moptimize init gnweightmatrix(M,W) function specifies the 
L x L fixed-weighting matrix W. If this is not specified, then it is assumed 
that W = I, the case here. Optimization yields 


. * moptimize() method gf1: Implement with heteroskedastic-robust 


> standard errors 


. mata: 


M = 


moptimize_init() 


moptimize_init_evaluatortype(M, "q0") 


moptimize_init_evaluator(M, &poissonq0()) 


moptimize_init_vcetype(M, "robust") 


moptimize_init_depvar(M, 1, "docvis") 


mata (type end to exit) 


moptimize_init_eq_indepvars(M, 1, "private chronic female income") 


moptimize_init_technique(M,"gn") 


moptimize_init_conv_maxiter(M, 10) 


initial: 


moptimize(M) 


f (p) 2.814e+11 
alternative: f(p) = 1.867e+11 
rescale: f (p) 7.285e+10 
Iteration 0: f(p) = 7.285e+10 
Iteration 1: f(p) = 6.841e+08 
Iteration 2: f(p) 91618347 
Iteration 3: f(p) = 24112.076 
Iteration 4: f(p) . 06168098 
Iteration 5: f(p) = 3.938e-12 
Iteration 6: f(p) 5.332e-19 
b = moptimize_result_coefs(M) 
b 
1 3 4 5 
1 . 7986653786 1.091865108 . 4925480693 .0035570127 -. 2297263379 
: end 


The estimated coefficients are the same as those from poisson. Note that 
function poissonqo () had to include a constant in X, while 
moptimize init eq indepvars() automatically included the constant. 


C.2 Mata optimize() function 


The Mata optimize () function can handle more general models than the 
moptimize() function. A simple Poisson example is given here. A more 
complex overidentified nonlinear generalized method of moments example 
is presented in section 16.8. 


C.2.1 optimize() d and gf evaluators 


Because y and x are used to denote dependent variables and regressors, the 
Mata documentation uses the generic notation that we want to compute real 
row vector p that maximizes the scalar function f(p). Note that p is a row 
vector, whereas in this book, we usually define vectors (such as (3) to be 
column vectors. 


An evaluator function calculates the value of the objective function at 
values of the parameter vector. It may optionally calculate the gradient and 
the Hessian. 


There are two distinct types of evaluator functions used by Mata. 


A type a evaluator returns the value of the objective as the scalar 
v = f(p). The minimal syntax is 


void evaluator (todo, p, v, g, H) 


where todo is a scalar, p is the row vector of parameters, v is the scalar 
function value, g is the gradient row vector ô f (p)/Op, and H is the Hessian 
matrix Of (p)/OpOp’. If todo equals zero, then numerical derivatives are 
used (method do), and g and H need not be provided. If todo equals one, 
then g must be provided (method 41), and if todo equals two, then both g and 
H must be provided (method a2). 


A type gf evaluator is more suited to m-estimation problems, where we 
maximize Q(@) = oe Gu (9)- Then it may be more convenient to provide 
an N x 1 vector with the jth entry q;(0) rather than the scalar Q(@). A type 


gf evaluator returns the column vector v, and f(p) equals the sum of the 
entries in v. The minimal syntax for a type gf evaluator is 


void evaluator(todo, p, v, g, H) 


where todo is a scalar, p is a row vector of parameters, v is a column vector, 
g is now the gradient matrix Ov /Op, and H is the Hessian matrix. If todo 
equals zero, then numerical derivatives are used (method gf0), and g and H 
need not be provided. If todo equals one, then g must be provided (method 
gf1), and if todo equals two, then both g and H must be provided (method 
gf2). 


Up to nine additional arguments can be provided in these evaluators, 
appearing after p and before v. In that case, these arguments and their 
relative positions need to be declared by using the 
optimize init arguments () function, illustrated below. For regression 
with data in y and X, the arguments will include y and X. 


C.2.2 Optimize functions 


The optimize () functions fall into four broad categories, similar to those 
already discussed for the moptimize() function. First, functions that define 
the optimization problem, such as the name of the evaluator and the iterative 
technique to be used, begin with optimize init. Second, functions that lead 
to optimization are optimize() Or optimize evaluate(). Third, functions 
that return results begin with optimize result. Fourth, the 

optimize query() function lists optimization settings and results. 


A complete listing of these functions and their syntaxes is given in [M- 
5] optimize(). The following example essentially uses the minimal set of 
optimize () functions to perform a (nonlinear) regression of y on x and to 
obtain coefficient estimates and an estimate of the associated variance— 
covariance matrix of the estimator. 


C.2.3 Poisson example 


We implement the Poisson MLE, using the Mata optimize () function method 
gf2. 


Evaluator program for Poisson MLE 


The key ingredient is the evaluator program, named pmlegf2 (). Because the 
gf2 method is used, the evaluator program needs to evaluate a vector of log 
densities, named indensity, an associated gradient matrix, named g, and the 
Hessian, named n. We name the parameter vector b. The dependent variable 
and the regressor matrix, named y and x, respectively, are two additional 
program arguments that will need to be declared by using the 

optimize init argument () function. 


For the Poisson MLE, from section 16.2.2, the column vector of log 
densities has the jth entry In f(y;|x;) = — exp(x;3) + x; By; — In y;!; the 
associated gradient matrix has the ith row {y; — exp(x;)}x,; and the 
Hessian is the matrix $; — exp(x;,3)x;x;. A listing of the evaluator 
program follows: 


. * optimize() method gf2: Evaluator function 
. mata 


mata (type end to exit) 
void pmlegf2(todo, b, y, X, Indensity, g, H) 
{ 


> 

> Xb = X*b~ 

> mu = exp(Xb) 

> Indensity = -mu + y:*Xb - Infactorial(y) 
> if (todo == 0) return 

> g = (y-mu) :*X 

> if (todo == 1) return 

> H = - cross(X, mu, X) 

> } 

: end 


A better version of this evaluator function that declares the types of all 
program arguments and other variables used in the program is given in 
appendix B.3.5. 


The optimize() function for Poisson MLE 


The complete Mata code has four components. First, define the evaluator, a 
repeat of the preceding code listing. Second, associate matrices y and X 
with Stata variables by using the st_ view () function. Third, optimize, which 
at a minimum requires the seven optimize () functions, given below. Fourth, 
construct and list the key results. 


. * optimize() method gf2: Implement with cluster--robust standard errors 
. mata 
mata (type end to exit) 
mata clear 


void pmlegf2(todo, b, y, X, lndensity, g, H) 


> 

> Xb = X*b7 

> mu = exp(Xb) 

> Indensity = -mu + y:*Xb - Infactorial(y) 
> if (todo == 0) return 

> g = (y-mu) :*X 

> if (todo == 1) return 

> H = - cross(X, mu, X) 

> 


} 

st_view(y=., ., "docvis") 

st_view(X=., ., tokens("private chronic female income cons")) 
S = optimize_init() 

optimize_init_evaluator(S, &pmlegf2()) 
optimize_init_evaluatortype(S, "gf2") 
optimize_init_argument(S, 1, y) 

optimize_init_argument(S, 2, X) 


: optimize_init_cluster(S, "age") // Define cluster for cluster--robust 
> standard errors 


k = cols(X) 
optimize_init_params(S, J(1,k,0)) 


b = optimize(S) 


Iteration f(p) = -33899.609 
Iteration f(p) = -19668.697 
Iteration f(p) = -18585.609 


Iteration f(p) = -18503.549 
Iteration f(p) = -18503.549 


Vbrob = optimize_result_V_robust(S) 


(0) 
1 
2: 
Iteration 3: f(p) = -18503.779 
4: 
5 


serob = (sqrt (diagonal(Vbrob)))”7 


b \ serob 
1 2 3 4 5 
1 . 7986653788 1.091865108 . 4925480693 .0035570127 - .2297263376 
2 . 1496492194 .0603102053 .0686027716 .0011792332 . 1453958671 


: end 


The s = optimize init () function initiates the optimization, and 
because s is used, the remaining functions have the first argument s. The 
next two optimize() functions state that the evaluator is named pmleg£f2 


and that optimize () method gf2 is being used. The subsequent two 
optimize() functions indicate that the first additional argument after b in 
program pmlegf2 is y and that the second is x. The next function provides 
starting values and is necessary. The b = optimize(s) function initiates the 
optimization. The remaining functions compute robust standard errors and 
print the results. The optimize init cluster () function defined clustering 
on age. If this line is omitted, then heteroskedastic—robust standard errors 
would instead be reported. 


The parameter estimates and standard errors are the same as those from 
the Stata poisson command with the vce (cluster age) option (see 
section 16.5.3). Nicely displayed results can be obtained by using the 
st_matrix() function to pass b’ and vbrob from Mata to Stata and then by 
using the ereturn display command in Stata, exactly as in the 
section 16.2.3 example. 


C.3 Additional resources 


See the [M] Stata Mata Reference Manual for the moptimize() and 
optimize() functions. Baum (2016, chap. 14) presents a detailed 
generalized method of moments example using moptimize(). 


Glossary of abbreviations 


e 2SLS — two-stage least squares 

e 3SLS — three-stage least squares 

e AFT — accelerated failure time 

e AIC — Akaike information criterion 

e AICC — Akaike information corrected criterion 
e AIPW — augmented inverse-probability weighting 
e AME — average marginal effect 

e AR — Anderson—Rubin 

e AR — autoregressive 

e ARMA — autoregressive moving average 
e ARUM — additive random-utility model 
e ATE — average treatment effect 

e ATET — average treatment effect on the treated 
e AUC — area under the curve 

e BC — bias-corrected 

e BCa — bias-corrected accelerated 

e BIC — Bayesian information criterion 

e BLP— Berry—Levinson—Pakes 

e CCE — common correlated estimator 

e c.d.f. — cumulative distribution function 
e CIF — cumulative incidence function 

e CL — conditional logit 

e CLR — conditional likelihood ratio 

e CQR — conditional quantile regression 

e CRE — correlated random effects 

e CV— cross-validation 

e DGP — data-generating process 

e DIC — deviance information criterion 

e DID — difference in differences 

e DV — dummy variable 

¢ DWH — Durbin—Wu—Hausman 

e ERM— extended regression model 

e ET — endogenous treatment 

e FAQ — frequently asked questions 


FD — first difference 

FDP — false discovery proportion 

FDR — false discovery rate 

FE — fixed effects 

FGLS — feasible generalized least squares 

FMM — finite-mixture model 

FPC— finite-population correction 

FRD— fuzzy regression discontinuity 

FWER — familywise error rate 

GAM — generalized additive models 

GLM — generalized linear models 

GLS — generalized least squares 

GMM — generalized method of moments 
GS2SLS — generalized spatial two-stage least squares 
GSEM— generalized structural equation model 
GUI— graphical user interface 

HAC — heteroskedasticity- and autocorrelation-consistent 
HRS— Health and Retirement Study 

ITA — independence of irrelevant alternatives 

i.i.d. — independent and identically distributed 

IM — information matrix 

IPW — inverse-probability weighting 

IPW-RA — inverse probability with regression adjustment 
ITT — intention to treat 

IV — instrumental variables 

JIVE — jackknife instrumental-variables estimator 
LATE — local average treatment effect 

LEF — linear exponential family 

LIML — limited-information maximum likelihood 
LM — Lagrange multiplier 

LOOCV — leave-one-out cross-validation 

LPM — linear probability model 

LR — likelihood ratio 

LS — least squares 

LSDV — least-squares dummy variable 

MA — moving average 

MAR — missing at random 


MCAR — missing completely at random 
MCMC — Markov chain Monte Carlo 

MD — minimum distance 

ME — marginal effect 

MEM — marginal effect at mean 

MER — marginal effect at representative value 
MG — mean group 

MH — Metropolis—Hastings 

ML — maximum likelihood 

MLE — maximum likelihood estimator 
MLT— multilevel treatment 

MM — method of moments 

MNAR — missing not at random 

MNL — multinomial logit 

MNP — multinomial probit 

MSE — mean squared error 

MSL — maximum simulated likelihood 

MSS — model sum of squares 

MTE — marginal treatment effect 

NB — negative binomial 

NB1 — negative binomial variance linear in mean 
NB2 — negative binomial variance quadratic in mean 
NL — nested logit 

NLS — nonlinear least squares 

NNM — nearest-neighbor matching 

NR — Newton-—Raphson 

NSW — National Supported Work 

OHIE — Oregon Health Insurance Experiment 
OHP — Oregon Health Program 

OLS — ordinary least squares 

PA — population averaged 

PFGLS — pooled feasible generalized least squares 
PH — proportional hazards 

PM — predictive mean 

POM — potential-outcome mean 

PSID — Panel Study of Income Dynamics 
PSM — propensity-score matching 


PSU — primary sampling unit 

QCR — quantile count regression 

QR — quantile regression 

QTE — quantile treatment effect 

RA — regression adjustment 

RCT — randomized control trials 

RD — regression discontinuity 

RE — random effects 

RIF — recentered influence function 
RMSE — root mean squared error 

ROC — receiver operator characteristics 
RPL — random-parameters logit 

RSS — residual sum of squares 

SAR — spatial autoregressive 

SARAR — autoregressive spatial autoregressive 
SEM — structural equation model 

SJ — Stata Journal 

SRD — sharp regression discontinuity 
STB — Stata Technical Bulletin 

SUR — seemingly unrelated regressions 
TE — treatment effect 

TSS — total sum of squares 

VCE — variance—covariance matrix of the estimator 
WLS — weighted least squares 

ZINB — zero-inflated negative binomial 
ZIP — zero-inflated Poisson 

ZTNB — zero-truncated negative binomial 
ZTP — zero-truncated Poisson 
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Wald confidence intervals, 11.3.9 , 11.3.11 


with desired precision, 11.8.5, 11.8.5 


constraint command, 3.5.4 , 6.8.6 


constraints () option, 3.5.4 


continue command, 1.8.4 
control function estimator, 7.4.7 
standard errors, 13.3.11 , 13.3.11 
correlate command, ADL ; 8.3.9 
correlation coefficient, 3.5.1 
autocorrelations for panel data, 8.3.9 
intraclass, 8.3.10 
intracluster, 6.4.3 
count command, 2.4.3 
count-data models, 13.2.2 , 13.3.2 
clustered data, 13.9 , 13.9.5 
Poisson introduction, 13.2.2 , 13.3.2 
creturn command, 14.6 
critical values, 11.2 , 11.2.5 
chi-squared versus F, 11.2.2 
computation, 11.2 , 11.2.5 
standard normal versus t, 11.2.1 , 11.2.5 
used in Stata commands, 11.2.5 
cubic splines, 14.5.2 


D 
data frames, 2.5.3 
data management, 2 , 2.5.7 
appending datasets, 2.5.7 
collapsing and expanding data, 2.5.4 
common pitfalls, 2.3.9 
data frames, 2.5.3 
data types, 2.2 , 2.2.4 
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demeaned variables, 2.4.7 


dictionary file, 2.3.8 
exporting data, 2.4.8 
importing data, 2.3.4 
imputing missing values, 2.4.6 


indicator variables, 2.4.7 , 2.4.7 


inputting data, 2.3 , 2.3.10 


interacted variables, 4.13.4, 2.4. 
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labeling variables, 2.4.2 

long form, 2.5.5 , 8.10 , 8.10.4 
manipulating datasets 25 ae 
merging datasets, 2.5.6 , 2.5.6 
missing values, 2.4.5 

naming variables, 2.4.2 

ordering data, 2.5.1 

outputting data, 2.3.10 ,2.4.8 
preserving and neces data, 2.5.2 
PSID example, 2.4.1 

saving data, 2.4.8 

selecting sample, 2.4.9 

special data formats, 2.3.4 


transforming data, 2.4.7 , 2.4.7 


wide form, 2.5.5 , S. 1 8104 


data summary Scaniples: 23 
for cross-sectional data, 3.2 , 3.2. 


for panel data, 8.3 , 8.3. 5 


data transformations, 2.4.7 , 2.4.7 , 3.3 , 3.3 , 3.4.8 , 3.4.8 
regression in logs, 3.4.8 
retransformation, 3.4.8 
data-generating process, 5.1 
dataset description 
bootstrap example, 12.3.3 
cigarette consumption panel in wide form, 8.10.1 
cigarette sales long panel example, 9.5.1 
CPS few clusters example, 12.6.3 
Medical Expenditure Panel Survey 
doctor-visits young, 10.2.1 
drug expenditures for IV, 7.4.2 
for SUR example, 6.8.3 
medical expenditures, 3.2.1 
Medicine in Australia: 
Balancing Employment and Life doctor earnings for FMM example, 
14. 


Balancing Employment and Life earnings for regression 
decomposition, 4.6.2 
NSW Panel Survey of Income Dynamics earnings diff-in-diff, 4.8.1 
Panel Study of Income Dynamics short panel wage data, 8.3.1 
PSID earnings data management example, 2.4.1 
the second National Health and Nutrition Examination Survey survey 
design data, 6.9.1 
Vietnam clustered village data, 6.4.1 
date() function, 2.4.10 
decode command, 2.4.7 
delimit command, 1.4.5 
delta method, 11.3.5 , 11.3.12 
describe command, 2.4.1 , 3.2.2 


descriptive statistics, 3.2.3 , 3.2.4 
destring command, 2.2.3 
deviance residual, 13.8.3 
deviance statistic, 13.8.3 
didregress command, 4.8.4 
difference in differences, 4.8 , 4.8.4 
average treatment effect on the treated, 4.8.4 
before—after comparison, 4.8.2 
OLS regression computation, 4.8.4 
parallel trends assumption, 4.8 
treatment—control comparison, 4.8.3 
difference in means test, 3.5.12 
display command, 1.5.1 
do command, 1.4.2 
do-files, 1.4 , 1.4.6 
template do-file, 1.11 
drawnorm command, 5.4.5 
drop command, 2.4.9 
dyad-robust standard errors, 6.4.5 , 6.4.5 
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E 

e-class commands, 1.6.2 , A.2.6 
egen command, 2.4.7 
elasticities 


in linear regression model, 4.5.7 

in nonlinear models, 13.7.14 , 13.7.15 
encode command, 2.4.7 
endogeneity test, see specification tests 


endogenous regressors, see also instrumental variables 


definition, 7.3.1 
dynamic panel model, 9.4 , 9.4.10 
dynamic panel systems, 9.4.10 
fixed-effects model, 8.2.2 , 8.5.1 
Hausman test, 11.9.5 , 12.4.6 , 12.4.6 
instrumental-variables regression, 7.3.3 
panel IV, 9.2 , 9.4.10 
simulation example, 5.6.5 
eregress command, 7 17149, 19.1 
ereturn 
command, A.1.7 
display command, 3.9 
list command, 1.6.2 
error messages, 1.3.7 , A.3.2 
estat 
abond command, 9.4.6 
bootstrap command, 12.3.7 
commands, 3.5.3 , 11.5.2 
endogenous command, 7.4.6 
firststage command, 7.6.5 
gof command, 11.9.3 
hettest command, 3.7.4 , 6.3.3 
ic command, 13.8.2 
imtest command, 3.7.5 , 11.9.2 
lcomean command, F 14.3.3 
lcprob command, 14.3.6 
overid command, T Lg , 11.9.4 
ovtest command, 3.7.2 
sargan command, 4.6 
vce command, 11.3.2 
estimates 
selected command, 3.5.6 
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stats command, 3.5 
store command, 3.5 
table command, 3.5 4 
estimation commands 
linear panel summary, 8.2.5 
nonlinear cross-sectional summary, 13.1 
estout command, 3.5.7 
eststo command, 3.5.7 
esttab command, 3.5.7 
etable command, 3.5.7 
etregress command, 7.4.10 
expand command, 2.5.4 
export command, 2.3.10 
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F 
factor variables, 1.3.4 , 1.3.4 , 2.4.7 , 2.4.7, 3.5.8 , 3.5.9, 13.7.11 , 13.7.13 


factor-variable notation, 3.5.9 
marginal effects, 4.5.4 
feasible generalized least squares, see generalized least squares 
finite mixture models, 14 , 14.3.10 
computational method, 14.2.6 
definition, 14.2.2 
linear regression example, 14.2.3 , 14.2.3 
log-linear regression example, 14.3 , 14.3.10 
marginal effects, 14.3.5 
mixture of normals, 14.2.1 , 14.2.1 , 14.2.5 
model selection, 14.3.8 , 14.3.8 
modeling considerations, 14.2.4 
multimodality, 14.3.1 
predict component means, 14.3.3 
predicted class posterior probabilities, 14.3.7 
predicted class probabilities, 14.3.6 
test of coefficient equality, 14.3.9 
varying mixture probabilities, 14.3.10 
fixed effects, see clustered data; panel data 
floating-point data, 2.2.2 
fmm prefix, 14.2.5 


foreach command, 1.8.1 , 13.7.10 
format command, 2.4.3 

formats to display data, 2.2.4 ,2.4.3 
forvalues command, 1.8.2 , 5.3.5 , 8.3.9 
fp prefix, 14.4.2 

frame command, 2.5.3 

frmttable command, 3.5.7 


fsum command, 3.2.4 


G 
gam command, 14.5.3 
Gauss—Hermite quadrature, 5.5.1 , 13.9.3 
generalized additive model, 14.7 
generalized estimating equations, 13.9.2 , 13.9.2 
generalized least squares, 6 , 6.11 
cluster-robust variance matrix, 6.2.3 
definition, 6.2.2 
efficient estimation, 6.2.2 
for heteroskedastic errors, 6.2.1 , 6.3.4 
for SUR model, 6.8.1 , 6.8.6 
leading examples, 62.5 
robust standard errors, 6.2.3 , 6.2.4 
robust variance matrix, 6.2.3 
seemingly unrelated regressions, 6.8 , 6.8.7 
WLS estimator, 6.2.3 , 6.3.5 
working matrix, 6.2.3 
generalized linear models, 13.3.7 , 13.3.8 
definition, 13.3.7 
link function, 13.3.8 
Poisson example, 13.3.8 
generalized method of moments, see also instrumental variables 
definition, 13.3.9 , 13.3.11 
for two-step estimators, 13.3.11 , 13.3.11 
nonlinear example, 13.3.10 
panel Arellano-Bond estimator, 9.4.3 
generate command, 2.4.1 
gettoken command, A.2.7 


Gibbs sampler simulation example, 5.4.6 
GLM, see generalized linear models 
glm command, 13.3.8 
global command, 1.7.1 
global macros, 1.7.1 , 1.7.1 
gmm command, 13.3.10 
goodness-of-fit measures, 13.8.1 , 13.8.1 
graph 
box command, 2.6.2 
combine command, 2.6.1 , 3.2.8 , 6.3. 


export command, 61 


matrix command, 2.6.7 
save command, 2.6.1 
twoway command, 2.6.1 
graphs, 2.5.7 , 2.6.7 
box-and-whisker plot, 2.6.2 
combining graphs, 2.6.1 , 6.3.3 
exporting, 2.6.1 
graph formats, 2.6.1 
graphing commands, 2.6.1 
histogram, 2.6.3 
kernel density plot, 2.6.4 , 2.6.4 , 3.2.8 , 11.2.3 
line plot, 2.6.5 
multiple scatterplots, 2.6.7 
nonparametric regression, 2.6.6 , 2.6.6 , 2.6.6 , 14.6.4 
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of known density function, 11.2.3 
of predictive margins, 4.4.6 
outlier detection, 3.6.2 
panel-data plots, 8.3.5 

quantile plot, 15.3.1 

residual plots, 3.6.2 , 6.3.3 
scatterplot, 2.6.5 

Stata Graph Editor, 2.6.1 


grqreg command, 15.3.10 
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H 
halton () Mata function, 5.5.4 


Halton sequences, 5.5.4 
Hammersley sequences, 5.5.4 
hausman command, 7.4.6 , 8.8.5 , 11.9.5 
Hausman test, see specification tests 
Hausman-Taylor estimator, 9.3 , 9.3.3 
help 

command, 1.2.3 

contents command, L23 

mata command, B.1.5 
heteroskedastic errors, feasible GLS, 6.2.1 
heteroskedasticity test, 3.7.4 , 6.3.3 , 15.3.8 
heteroskedastic—robust, see variance-covariance matrix 
hierarchical models, 6.7.7 , 6.7.7 
histogram command, 2.6.3 
hypothesis tests, 11 , 11.12 

asymptotic refinement, 12.2.3 , 12.5 , 12.6.5 


auxiliary regression ooa on, L 5. 3L 
Benjamini—Hochberg procedure, 11.6.4 
Bonferroni correction, 11.6.1 

bootstrap methods, 12 , 12.11 
chi-squared versus F, 11.2.2 , 11.2.5 
cross-equation restrictions, 6.8.5 

delta method, 11.3.5 , 11.3.12 

effect size, 11.8 

false discovery proportion, 11.6.4 

false discovery rate, 11.6.4 

familywise error rate, 11.6.1 , 11.6.1 
few clusters, 6.4.6 , 6.4.6 , 11.3.2 
Hochberg’s step-up procedure, 11.6.1 
Holm’s step-down procedure, 11.6.1 
invariance under transformation, 11.4.1 
invert to give confidence interval, 7.7.3 , 11.3.13 , 11.3.13 
Lagrange multiplier tests, 11.5 , 11.5.3 
likelihood-ratio tests, 11.4 , 11.4.4 
linear hypotheses, 11.3.1 , 11.3.1 

linear regression model, 3.5.5 
minimum effect size, 11.8.1 


minimum sample size, 11.8.1 
multiple outcomes, 11.6.3 
multiple testing, 11.6 , 11.6.5 
nonlinear hypotheses, 11.3.5 , 11. 
on difference in means, 3.5.1 

on population mean, 3.2.7 , 3.5.12 
one-sided tests, 11.3.4 

percentile-¢ method, 12.5.1 , 12.6.5 
permutation tests, 11.10 , 11.10 


mo test, 11.7.3 
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power in more Hons one direcho. 3.76. 3.76 
power with clustering, 11.8.4 , 11.8.4 
pretest bias, 11.3.8 

randomization tests, 11.10 , 12.6.1 

score tests, 11.5 , 11.5.3 

Sidak correction, 11.6.1 

size, 5.6.2 , 11.7.2 , 11.7.3 

standard normal versus t, 11.2.1 , 11.2.5 
subgroup analysis, 11.6.2 , 11.6.2 

tests at the boundary, 11.4.4 

Wald test examples, 11.3.3 , 11.3.3 , 11.3.6 
Wald tests, 11.3 , 11.3.13 

with weak instruments, 7.7 , 7.8 


I 

idcluster() option, 12.3.5 
if qualifier, 1.3.1 

import command, 2.3.4 

in qualifier, 1.3.1 

indicator variables, 2.4.7 , 2.4. 


infile command, 2.3.6 

infix command, 2.3.7 

influential observations, 3.6.3 , 6.4.6 , 7.8 
information criteria, 13.8.2 , 13.8.2 , 14.3.8 
information matrix test, see ook teak tests 


input command, 2.3.3 


instrumental variables, 7 , 7.12 
2SLS bias, 7.5.7 
2SLS estimator, 7.3.3 , 7.4.1, 7.5 , 7.8, 9.4.3 
2SLS example, 7.4.4 
3SLS estimator, 7.10 , 7.10 
3SLS example, 7.10 
alternative estimators, 7.9 , 7.9.4 
Anderson—Rubin test, 7.7.2 
basic theory, 7.3.1 , 7.3.6 
binary endogenous regressor, 7.4.10 , 7.4.10 
bootstrap, 7.8 
concentration parameter, 7.5.7 
conditional LR tests, 7.7.4 , 7.7.4 
control function estimator, 7.4.7 , 13.3.11 , 13.3.1] 
Cragg—Donald minimum eigenvalue statistic, 7.6.3 
effective F statistic, 7.6.4 
example, 7.4 , 7.4.11 
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finite-sample properties, 7.5.2 , 7.5.8 , 7.8, 7.8 
first-stage equation, 7.3.2 

first-stage F statistic, 7.5.6 , 7.5.6 , 7.6.6 
GMM estimator, 7.3.3 

Hansen test, 7.4.8 

inconsistency of OLS, 7.3.1 

IV estimator, 7.3.3 

jackknife IV estimator, 7.9.2 
just-identified model, 7.3.3 , 7.4.4 , 7.6.6 
k-class estimator, 7.9.1 

LATE, see LATE 

minimum distance-based tests, 7.7.5 , 7.8 
nonlinear IV, 13.3.10 

optimal GMM estimator, 7.3.3 , 7.4.1 


optimal GMM example, 745 
overidentified model, 7.3.3 , 7.4.5 , 7.6.8 
panel IV estimator, 9.2 , 9.4.10 

panel systems IV Sstimator. 9.4.10 
partial R2, 7.6.2 , 7.6.6 


reduced form, 7.2.2 
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relative asymptotic bias test, 7.6.4 , 7.6.7 
relative bias test, 7.6.4 
robust standard errors, 7.3.6 
Sargan test, 7.4.8 , 9.4.6 
sensitivity to instrument choice, 7.6.11 
simple example, 7.3.1 
simultaneous equations model, 7.2 , 7.2.5 
structural equation, 7.3.2 
structural model approach, 7.2.1 , 7.2.5 
test of endogeneity, 7.4.6 , 7.4.7 
test of overidentifying restrictions, 7.4.8 , 7.4.8 
test of underidentification, 7.6.3 
test size distortion, 7.5.8 
two-sample 2SLS estimator, 7.9.4 
underidentified model, 7.3.3 
valid instruments, 7.3.4 
Wald size-distortion test, 7.6.4 
weak instruments, 7.3.5 , 7.3.5 , 7.5, 7.8 
asymptotics, 7.7.4 , 7.7.4 
diagnostics, 7.6 , 7.6.2 
inference, 7.7 , 7.8 
simulation, 7.5.3 , 7.5.5 
tests, 7.6.4 , 7.6.10 
wild bootstrap for IV, 12.6.5 , 12.6.5 
integral computation, 5.5 , 5.5.4 
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interacted regressors, 3.5.8 , 3.5.9 , 13.7.1 


interacted variables, 1.3.4 “13.4, 247 
interactive use of Mata, B.1.4 L 4 
interactive use of Stata, ie ee al 
intraclass correlation, 8.3.1 
intracluster correlation coefficent, 6.4.3 
inverse-probability weighting, 3.8.3 
invnormal () function, 5.2.2 
ipolate command, 2.4.6 
iqreg command, 15.2.4 
iterative methods 

evaluator types, C.1.1 , C.2.1 
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linear program simplex method, 15.2.2 
Mata optimization, C.1 
moptimize() Mata function, C.l , C.1.6 
optimization in Mata, C.1 , C.3 
optimize () Mata function, C.2 , C.3 
Poisson example in Mata, C.2.3 , C.2.3 
ivreg2 command, 7.6.9 
ivregress command, 7.4.1 


J 

jackknife, see variance-covariance matrix 
jackknife IV estimator, 7.9.2 

jackknife prefix, 12.9.2 

javacall command, 1.9 


jive command, 7.9.2 


K 
k-class estimator, 7.9.1 
kdensity command, 2.6.4 , 3.2. 


keep command, 2.4. 9 
kernel density plot, 2.6.4 , 2.6.4 


knnreg command, 1 4.6.2 


kurtosis measure, 3.2.4 


L 
label command, 2.4.2 
Lagrange multiplier test, see hypothesis tests 
LATE, 7.6.11 
estimator, 7.4.11 
latent class models, see finite mixture models 
least-absolute-deviations regression, 3.6.5 , 3.6.6 , 15.2.2 
least-squares dummy-variable estimator, 6.6.4, 8.5.4 , 8.5.4 
LEF, see linear exponential family 
leverage, 6.4.6 , 7.8 
likelihood-ratio test, see hypothesis tests 
LIML estimator, 7.9.1 
lincom command, 11.3.10 


linear exponential family, 13.3.1 
definition, 13.3.7 
examples, 13.3.1 
Poisson example, 13.3.7 
linear mixed models, 6.7 , 6.7.8 , 8.2.2 
cluster-robust standard errors, 6.7.5 , 6.7.6 
linear probability model, 10.4.7 
linear regression model, 3.4,3.11,4,4.10 
basic results, 3.4.1 , 3.4.7 
bootstrap variance ern 347. 35l SAL 
Box-Cox transformation, 3.7.3 
cluster FE example, 6.6 , 6.6.6 
cluster FGLS example, 6.5 , 6.5.5 
cluster OLS example, 6.4 , 6.4.6 
cluster PA example, 6.5.2 , 6.5.2 
cluster RE example, 6.5.3 , 6.5.5 
cluster-robust variance, 3.4.6 , 3.5.10 


constraints on parameters, 3. L 4 , 6.8.6 


decomposition analysis, 4.6 , 4.7.1 
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default standard errors, 3.4.4 4 
difference in differences, 4.8 , 4.8.4 
elasticities, 4.5.7 

endogenous regressors, 5.6.5 , 7.1 , 7.12 
for one-sample ¢ test, 3.5.12 

for two-sample ¢ test, 3.5.12 
generalized least squares, 6 , 6.11 
heteroskedastic errors FGLS, 6.3.4 
heteroskedastic—robust variance, 3.4.5 
hypothesis tests, 3.5.5 

influential observations, 3.6.3 


instrumental variables, 11, 412 


log versus linear regression, 3.4.8 
marginal effects, 4.5 , 4.5.8 
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Mata code for OLS, 3.9 , 3.9 
measurement-error example, 5.6.4 
omnibus test, 3.7.5 

panel-data basics, 8 , 8.12 


panel-data extensions, 9 , 9.7 
prediction, 4.2 , 4.4.8 

prediction in logs, 4.2.3 

prediction of actual value, 4.3.1 
prediction of conditional mean, 4.3.1 
prediction with binary regressor, 4.2.4 
regression in logs, 3.4.8 

residual analysis, 3.6.2 , 3.6.2 


retransformation problem, 4.2.3 , 4.2.3 
sampling weights, 3.8 , 3.8.4 
simulations to verify properties, 5.6 , 5.6.5 
specification analysis, 3.6 , 3.6.6 
specification tests, 3.7 , 3.7.6 
standard error estimation, 3.4.4 , 3.4.7 
survey data, 6.9 , 6.9.3 
systems of equations, 6.8 , 6.8.7 
test against log-linear, 3.7.3 
test interpretation, 3.7.6 , 3.7.6 
test of heteroskedasticity, 3.7.4 , 6.3.3 
test of normality, 3.7.5 
test of omitted variables, 3.7.1 
transform dependent variable, 3.3 , 3.3 
two-stage least squares, 7.3.3 
variance matrix estimation, 3.4.4 , 3.4.7 
weighted marginal effects, 3.8.4 
weighted prediction, 3.8.4 
weighted regression, 3.8.3 , 6.9.3 

link function, 13.3.8 

linktest command, 3.7.2 


list command, 2.3.3 , 2.4. 


local command, 1.7. 2 
local constant regression, 2.6.6 , 14.6.3 


local linear regression, 2.6.6 , 14.6.3 


local polynomial: A 2. 2.6.6 6, 14. 14.6.3 3, 14. 14.6.4 4o 
log command, 1.4.3 
log file, 1.4.3 


logical operators, 1.3.6 
logit command, 10.5 
logit model 
brief summary, 10.2.2 
log-linear regression, 3.4.8 
retransformation problem, 4.2.3 
test against linear, 3.7.3 
lognormal data 
linear regression model, 3.4.8 
loneway command, 6.4.3 
long command lines, 1.4.5 
long-form data, 2.5.5 , 8.10 , 8.10.4 
longitudinal data, see panel data 
looping commands, 1.8 , 1.8.4 
foreach, 1.8.1 
forvalues, 1.8.2 
while, 1.8.3 
lowess command, 2.6.6 , 14.6.5 
lowess regression, 2.6.6 , 14.6.5 
lpoly command, 2.6.6 , 14.6.4 
lrtest command, 11.4.2 


M 

macros, 1.7 , 1.7.3 
compared with scalars, 1.7.3 
global, 1.7.1 , 1.7.1 


local, 1.7.2 , 1.7.2 
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calculus method, 4.5.1 , 10.4.1 , 13.7.1 
calculus versus finite difference, 13.7.5 
elasticities, 4.5.7 , 13.7.14 , 13.7.15 
factor-variable notation, 3.5.9 , 4.5.4 , 13.7.11 , 13.7.13 


finite-difference method, 4.5.1 , 10.4.1 , 13.7.1 


in linear regression model, 4.5 , 4.5.8 
in log-linear model, 3.4.8 
in nonlinear models, 10.4 , 10.4.7 , 13.7, 13.7.16 
in treatment-effects models, 10.4.3 
interacted regressors, 3.5.9 9, 3.5.9 , 4.5.4 
introduction for AME, 3.5.9 9, a 
manual computation, 13.7. 10 , 13.7.10 
polynomial regressors, 13.7.11 
population average marginal effects, 13.7.9 
predictive margins, 4.4 , 4.4.8 , 13.6 , 13.6.3 
semielasticities, 4.5.8 
summary of types, 4.5.2 , 10.4.2 , 13.7.2 
weighted, 3.8.4 
which to use, 13.7.9 

marginal treatment effects, 13.7.16 
user-provided quantity, 13.7.8 


margins command, 4.4.2 š 10.4.4 > 13.6.1 3 13.7.5 
margins, 
contrast command, 4.4.8 


dydx () command, 3.5 9 , 4.5.3 , 13.7.4 , 13.7.5 
expression () command. 13.6.1 
pwcompare command, 4.4.7 
marginsplot command, 4.4.6 
Markov chain Monte Carlo methods 
draws example, 5.4.6 , 5.4.6 
Mata, 1.9,B,B3.5,C, C3 3 
ado-files using Mata, B.3.4 
colon operators, B.2.2 
combining matrices, B.2.5 
commands in Mata, B.1.1 
commands in Stata, B.1.2 
declaration of program arguments, B.3.5 
element-by-element operators, B.2.2 
Gibbs sampler example, 5.4.6 
help command, B.1.5 
identity matrix, B.2.1 
matrix 


cross products, B.2.4 
functions, B.2.3 
input, B.2.1 
inversion, B.2.3 
of constants, B.2.1 
operators, B.2.2 
subscripts, B.2.5 
OLS regression example, 3.9 , 3.9 
optimization functions, C.l , C.3 
optimization in Mata, C.1 
overview, B 
program example, B.3.1 
programming in Mata, B.3 , B.3.5 
Stata commands in Mata, B.1.3 
Stata interface functions, B.2.1 
Stata matrix from Mata matrix, B.2.6 
Stata variables from Mata matrix, B.2.6 
matrices in Mata, see Mata 
matrices in Stata, A.l , A.1.7 
combining matrices, A.1.3 
matrix 
cross products, A.1.6 
functions, A.1.5 
input, A.1.2 
operators, A.1.4 
subscripts, A. 1.3 
OLS example, A.1.7 
overview, A.1.1 
matrix 
accum command, A.1.6 


command, 1.5.2 , Al. A.l , A.1.7 
define command, A.1.2 
rownames command, A.1.2 
vecaccum command, A.1L.6 
matrix algebra definitions, 3.4.2 
maximum likelihood, 13.3.1 , 13.3.3 


definition of MLE, 13.3.1 


misspecified density, 13.3.1 
pseudo-MLE, 13.3.1 
quasi-MLE, 13.3.1 
mean group estimator, 9.5.8 
measurement-error example, 5.6.4 
median regression, 3.6.5 , 3.6.6 , 15.2.2 
MEM, see marginal effects, at the mean 
MER, see marginal effects, at representative value 
merge command, 2.5.6 
mhtexp command, IL 11.6.5 5 
missing data, 2.4.5 , 2.4.6 
missing-value codes, 2.4.5 
missing-values imputation, 2.4.6 
mixed command, 6.7.3 , 6.7.7 
mixed models 
linear, 6.7 , 6.7.8 
nonlinear, 13.9.3 
mkspline command, 14.5.1 , 14.5.2 
mlexp command, 13.3.3 
model approach versus weighted regression, 3.8.3 
model diagnostics, 13.8 , 13.8.4 
model selection, 11.3.7 , 13.8.2 , 13.8.2 
backward selection, 11.3.7 
based on statistical significance, 11.3.7 , 11.3.8 
finite mixture models, 14.3.8 
forward selection, 11.3.7 
information criteria, 13.8.2 , 13.8.2 , 14.3.8 
pretest bias, 11.3.8 
stepwise selection, 11.3.7 
moment-based tests, see specification tests 
Monte Carlo methods, see simulation 
moptimize() Mata function, C.1 , C.1.6 
method a2, C.1.4 
method gf*, C.1.5 
method q0, C.1.6 
method q1, C.1.6 
Poisson example, C.1.3 , C.1.6 


mtest option, 11.6.1 , 11.6.2 


multproc command, 11.6.1 , 11.6.2 
mvdecode command, 2.4.5 


N 

natural spline, 14.5.2 

nearest-neighbors regression, 14.6.2 

nl command, 10.6.1 , 13.3.6 

nlcom command, 11.3.11 

nonlinear IV, 13.3.10 

nonlinear least squares, 13.3.5 , 13.3.6 
estimator definition, 10.6 , 13.3.5 
example, 13.3.6 
introduction, 10.6 , 10.6.2 


nonlinear regression, 10. 10.9 , 13 , 13.11 


cluster FE example, 13. 94,13.9.5 


cluster PA example, 13.9.2 , 13.9.2 

cluster RE example, 13.9.3 , 13.9.3 

clustered data example, 13.9 , 13.9.5 
generalized linear models, 13.3.7 , 13.3.8 
generalized method of moments, 13.3.9 , 13.3.11 
introduction, 10 , 10.9 
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logit as example, 10.5 , 10.5 


marginal effects, 10.4 , 10.4.7 , 13.7 , 13.7.16 
maximum likelihood, 13.31, 13.3.3 
mixed models, 13.9.3 
nonlinear least squares, 10.6 , 10.6.2 , 13.3.5 , 13.3.6 
Poisson as example, 13.2 , 13.9.5 
prediction, 13.5 , 13.5.6 
predictive margins, 13.6 , 13.6.3 
probit as example, 10.3 , 10.4.7 
standard errors, 13.4 , 13.4.9 
Stata commands summary, 13.1 
summary of methods, 13.3 , 13.3.12 
variance estimate, 13.4 , 13.4.9 
nonparametric methods, 2.6.4 , 2.6.6 , 14.6, 14. 


bandwidth choice, 2.6.4 


Ei 


graphs, 2.6.6 , 2.6.6 , 14.6.4 
kernel aiy esate on, : 
kernel function, 2.6.4 , 2.6. 4 2.6. 
kernel regression, 2.6.6 6, 1 
local 

constant, 2.6.6 , 14.6.3 

linear, 2.6.6 , 14.6.3 

polynomial, 2.6.6 , 14.6.3 , 14.6.4 
lowess regression, 2.6.6 , 14.6.5 
nearest neighbors, 14.6.2 
plugin bandwidth, 2.6.6 , 14.6.4 
semiparametric regression, 14.7 , 14.7 


npregress kernel comman nd, 2 6 š 14.6.4 


npregress series command, 1 46. 
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O 
oaxaca command, 4. 4.6.2 2 
odds ratio for logit model, 10.5 
one-sample ¢ test, 3.5.12 

computed by OLS regression, 3.5.12 
operators, 1.3.6 , 1.3.6 
optimize() Mata function, C.2 , C.3 

method g£2, C.2.3 

Poisson example, C.2.3 , C.3 
order command, 2.5.1 
ordinary least squares, see linear regression model 
orthpoly command, 14.4.2 
outfile command, 2.4.8 
outreg command, 3.5.7 
outreg2 command, 3.5.7 
O 
O 


utsheet command, 2.4.8 
utsum command, 3.2.4 


overidentifying restrictions test, see specification tests 


P 
panel data, 8 , 8.12 ,9 , 9.7 


2SLS estimator, 9.4.3 


Arellano-Bond estimator, 9.4 , 9.4.10 
augmented mean group estimator, 9.5.8 
balanced panel, 8.2.1 
basics, 8 , 8.12 
between SUMMO, 8.6 , 8.6.2 
between variation, 8.3.4 , 8.8.2 
cluster-robust inference, 8.2.3 Da 
cointegration, 9.5.9 , 9.5.9 , 9.5.9 
common correlated estimator, 9.5.8 
comparison of estimators, 8.8 , 8.8.6 
correlated random-effects model, 8.2.2 , 8.7.4 
data management, 8.10 , 8.10.4 
data summary, 8.3 , 8.3.10 
dynamic 

linear model, 9.4 , 9.4.10 

systems model, 9.4.10 
endogeneity, 8.2.2 
endogenous regressors, 9.2 , 9.4.10 


exogeneity, 8.9.2 


few clusters, 8.2.3 , 12.6.1 , 12.6.4 
first-difference 
estimator, 8.9 , 8.9.2 
IV estimator, 9.4.2 
model, 9.4.2 
fixed effects, 8.5.5 
fixed versus random effects, 8.8.4 


fixed-effects estimator, 8.5 , 8.5.5 , 9.5.5 
fixed-effects model, 8.2.2 , 8.5.1 

GMM estimator, 9.4.3 

Hausman test, 8.8.5 

Hausman-Taylor estimator, 9.3 , 9.3.3 
heterogeneous panels, 9.5.8 

individual-effects model, 8.2.2 , 8.5.1 , 8.7.1 , 9.5.5 
individual-invariant regressor, 8.2.1 , 8.3.4, 
instrumental-variables estimator, 9.2 2 9.4.10 
interactive-effects estimator, 9.5.6 
least-squares dummy-variable estimator, 8.5.4 , 8.5.4 


linear 

extensions, 9 , 9.7 

mixed models, 8.2.2 

model overview, 8.2 , 8.2.5 
long panel, 8.2.1 , 9.5 , 9.5.9 , 9.5.9 
long-form data, 8.10 , 8.10.4 
mean group estimator, 9.5.8 
Mundlak correction, 8.7.4 
pooled model, 8.2.2 , 8.4.1 
population-averaged estimator, 8.4 , 8.4.4 
population-averaged model, 8.2.2 
prediction, 8.8.6 
R? 8.8.2 
random-coefficients model, 
random-effects estimator, 8.7 , 8.7.4 , 9.5.5 
random-effects model, 8.2.2 , 8.7.1 
separate regressions, 9.5.7 , 9.5.8 
short panel, 8.1 , 9.4 
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spatial correlation in panel, 9.5.3 
Stata linear commands summary, 8.2.5 
systems IV estimator, 9.4.10 


time-series autocorrelations, 8.3.9 
time-series operators, 8.3.9 
time-series plots, 8.3.5 
two-way fixed effects, 8.5.5 , 8.7.4 
two-way-effects model, 8.2.2 , 9.5.2 
unbalanced panel, 8.2.1 , 8.3.3 , 8.8.1 
unit roots, 9.5.9 , 9.5.9 
variance components, 8.8.1 
wide-form data, 8.10 , 8.10.4 
within 
estimator, 8.5 , 8. 
scatterplot, 8.3.7 
variation, 8.3.4 , 8.8.2 


sie 


partial linear model, 14.7 
percentiles, 3.2.4 
permutation tests, see hypothesis tests 
piecewise linear regression, 14.5.1 
plots, see graphs 
plugin command, 19 
poisson command, |e ees 
polynomials 

fractional, 14.4.2 

global, 14.4.1 , 14.4.2 , 14.6.6 

local, 14.6.3 , 14.6.4 

orthogonal, 14.4.2 
post command, 5.3.4 
postclose command, 5.3.4 
postestimation commands summary, 3.5.3 , 13.3.4 
postfile command, 5.3.4 , 11.7.4 
power calculations, 11.8 , 11.8.4 
power commands, 11.8 , 11.8.4 
power onemean command, 11.8.1 
predict command, 4.2.1 , 8.8.6 , 13.5.1 


prediction, 4.2 , 4.3.2 , 13.5 , 13.5.6 
at a specified value, 13.5.4 
in levels from log model, 4.2.3 
in linear mixed models, 6.7.6 
in linear panel-data model, 8.8.6 
in linear regression model, 4.2 , 4.4 
in nonlinear models, 13.5 , 13.5.6 
in sample, 4.2 , 4.2.5 
of actual value, 4.3.1 , 13.5.2 
of conditional mean, 4.3.1 , 13.5.2 
of nonlinear quantity, 13.5.1 
out of sample, 4.3 , 4.3.2 , 13.5.3 
weighted, 3.8.4 
with binary regressor, 4.2.4 


predictive margins, 4.4 , 4.4.8 , 13.6, 13.6.3 


at a specified value, 13.6.2 
at means, 13.6.2 


average, 13.6.1 , 13.6.2 
contrasts of predictive margins, 4.4.8 
for a categorical variable, 4.4.4 , 13.6.3 
for a continuous variable, 4.4.5 
in nonlinear models, 13.6 , 13.6.3 
pairwise comparisons, 4.4.7 
plots of predictive margins, 4.4.6 
predictive means, 4.4 , 4.4.8 
predictnl command, 13.5.1 
prefix command, 1.3.3 
preserve command, LENE 4.2.4 š 13.5.4 j 13.7.10 
probit command, 10.3.1 
probit model 
brief summary, 10.2.2 
program define command, A.2.1 
program drop command, A.2.2 
programs, A.2 , A.4 
bootstrap example, 12.4.4 , 12.5.2 , 12.7.2 
central limit theorem example, 5.3.1 , 5.3.5 
debugging, A.3 
in Mata, B.3 , B.3.5 
in Stata, 5.3.1 , A.2 , A.4 
Monte Carlo a ieee ton: oe 5 
named positional arguments, A. 2.5 
OLS with chi-squared errors, 5.6.1 , 5.6.3 
overview, A.2 
parsing syntax, A.2.7 
positional arguments, A.2.3 
r-class example, A.2.6 
temporary variables, A.2.4 
prvalue command, 13.7.12 
pseudo- R2, 13.8.1 
p-values, 11.2 , 11.2.5 
bootstrap, 12.5.2 
chi-squared versus F, 11.2.2 
computation, 11.2 , 11.2.5 
standard normal versus t, 11.2.1 , 11.2.5 


used in Stata commands, 11.2.5 
pvar command, 9.4.10 
pwcorr command, LaL 
python command, 1.9 


Q 


qnorm command, 3.6.2 


qplot command, 15.3.1 


qqplot command, 3.6.2 
qreg command, 3.6.5 , 15.2.4 
qreg2 command, 15.3. 
quadrature methods, 5.5.1 

adaptive Gauss—Hermite, 5.5.1 
quantile regression, 15 , 15.7 

bootstrap robust variance, 15.2.4 

censored data, 15.3.11 

coefficient comparison across quantiles, 15.3.5 , 15.3.10 

coefficient interpretation, 15.3.3 

computation, 15.2.2 

conditional quantiles, 15.2 , 15.5 


data example, 15.3 , 


15.3.10 
definition of quantiles, 15.1 
endogenous data, 15.3.11 
generated data example, 15.4 , 15.4.2 
loss functions, 15.2.1 
marginal effects, 15.3.4 
quantile estimator, 15.2.2 
retransformation, 15.3.4 
robust standard errors, 15.2.3 , 15.3.6 
test of coefficient equality, 15.3.9 
test of heteroskedasticity, 15.3.8 
treatment effect, 15.5 , 15.5 
variance matrix estimation, 15.2.3 

query command, 1.4.6 


R 
R?, 13.8.1 


for linear panel models, 8.8.2 

partial R2, 7.6.2 

pseudo- R2, 13.8.1 
random effects, see clustered data; panel data 
randomization tests, see hypothesis tests 
random-number generation, see simulation 
rbeta() function, 5.2.3 
rbinomial() function, 5.2.4 
rchi2() function, 5.2.3 
r-class a, 1.6.1, 5.3.1, A.2.6 
recode command, 2.4.7 
recursive models, 7.2.3 
reg3 command, 7 E 
rego command, 47 
regress command. d 2 3, 1.6.2 , 3.5.2 , 6.6.4 
regression decomposition analysis 

Oaxaca—Blinder, 4.6.1 , 4.6.2 

Shapley relative importance, 4.7 , 4.7.1 
regression splines, 14.5 , 14.5.3 , 14.6.6 

basis expansions, 14.5.3 

cubic splines, 14.5.2 

knots, 14.5.1 

natural spline, 14.5.2 

piecewise linear regression, 14.5.1 

restricted spline, 14.5.2 

smoothing splines, 14.5.3 


relational operators, 1.3. 


rename command, 2 


replace command, 245, 24.7 
report generation, 3.10 
resampling methods, see bootstrap methods 
reshape command, 23a ; 4.8.4 f 8.10 , 8.10.4 
residuals 
definitions of various residuals, 13.8.3 , 13.8.3 
residual plots, 3.6.2 
restore command, 23. 2. 24 5 13.5.4 3 13.7.10 


return code, 1.3.7 , A.3.2 


return command, 5.6.1 

return list command, 1.6.1 

rgamma() function, 5.2.3 

rivtest peed TIa 

rnbinomial () function, 5.2.4 

rnormal () function, 5.2.2 

robust regression, 3.6.1 , 3.6.1 , 3.6.4 , 3.6.6 
influential observations, 3.6.3 

robust standard errors, see variance-covariance matrix 

rpoisson() function, 5.2.4 

rreg command, 3.6.4 
) function, 5.2.3 

runiform() function, 5.2.1 


rvfplot command, 3.6.2 


S 
sample command, 2.4.9 
sampling weights, 3. 8. l, 
in linear regression mode. 3.8 ,3.8.4,6.9,6.9.3 
weighted mean, 3.8.2 , 6.9.2 
save command, 2.4.8 
saveold command, 2.4.8 
scalars, 1.5.1 , 1.7.3 
compared with macros, 1.7.3 
scatterplot, 2.6.5 
score test, see hypothesis tests 
search command, 1.2.4 
seed for bootstrap, 12.2.4 
seed for random-number generation, 5.2.1 
seemingly unrelated estimations, 6.8.7 
seemingly unrelated regression equations, 6.8 , 6.8.7 
cross-equation restrictions test, 6.8.5 
feasible GLS estimation, 6.8.1 , 6.8.6 
FGLS example, 6.8.3 
imposing cross-equation restrictions, 6.8.6 
robust standard errors, 6.8.4 
SUR estimator, 6.8.1 


SUR model, 6.8.1 

test of error independence, 6.8.3 
selection models 

two-step estimation, 12.4.5 
semielasticities, 13.7.14 , 13.7.15 

in linear regression model, 4.5.8 
semiparametric methods, 14.7 , 14.7 
series regression, 14.6.6 , 14.6.6 


set 
command, 1.4.6 
dots command, 3.5.11 
iterlog command, 3.6.6 
matsize command, 6.6.4 
seed command, 5.2.1 , 12.2.4 , 15.2.4 
simulate prefix, 5.3.2 
simulation, 5 , 5.8 
asymptotic power, 11.7.5 
bias of estimator, 5.6.2 
bias of standard error estimator, 5.6.2 
bootstrap example, 12.7.2 
central limit theorem example, 5.3 , 5.3.5 
Cholesky decomposition, 5.4.5 
computing integrals, 5.5 , 5.5.4 
direct transformation, 5.4.2 
distribution of sample mean, 5.3 , 5.3.5 
draws from multivariate normal, 5.4.5 
draws from normal, 5.2.2 
draws from truncated normal, 5.4.4 
draws from uniform, 5.2.1 
draws of continuous variates, 5.2.3 
draws of discrete variates, 5.2.4 
draws using Markov chain Monte Carlo method, 5.4.6 , 5.4.6 
endogeneity example, 7.2.4 , 7.2.5 
endogenous regressors example, 5.6.5 
FGLS heteroskedastic errors example, 6.3.1 
Gibbs sampler example, 5.4.6 
Halton sequences, 5.5.4 


Hammersley sequences, 5.5.4 
inconsistency of estimator, 5.6.4 


interpreting simulation output, 5.6.2 
inverse-probability transformation, 5.4.1 
linear regression example, 5.6 , 5.6.5 
measurement-error example, 5.6.4 

mixture of distributions, 5.4.3 

Monte Carlo integration, 5.5.3 , 5.5.4 

OLS with chi-squared errors, 5.6.1 , 5.6.3 
pseudorandom numbers, 5.2 , 5.2.4 

seed, 5.2.1 

test power computation, 5.6.3 , 11.7.4 , 11.7.4 
test size computation, 5.6.2 , 11.7.2 , 11.7.2 
using actual data, 11.7.3 , 11.7.3 

using postfile command, 5.3.4 

using simulate prefix, 5.32,56 


weak-instruments example, 7.5.3 , 7.5.5 


simultaneous equations model, 7.2 , 7.2.5 
3SLS estimation, 7.10 , 7.10 
first-stage equation, 7.2.3 
recursive system, 7.2.3 
reduced form, 7.2.2 
structural equation, 7.2.3 
structural model, 7.2.1 

single-index models, 14.7 
coefficient interpretation, 10.4.6 , 13.7.3 
definition, 13.7.3 

skewness measure, 3.2.4 

sort command, 2.5.1 

specification analysis, 3.6 , 3.6.6 


chi-squared goodness-of-fit test, 11.9.3 
first-stage F statistic, 7.6.4 , 7.6.10 

for fixed effects, 8.8.5 , 8.8.5 

for serially uncorrelated panel errors, 9.4.6 


information matrix test, 11.9.2 


moment-based tests, 11.9.1 

of endogeneity, 7.4.6, 74.6, 12.4.6 , 12.4.6 

of heteroskedasticity, 3.7.4 , 6.3.3 , 15.3.8 

of omitted variables, 3.7.1 

of overidentifying restrictions, 7.4.8 , 7.4.8 , 9.4.6, 11.9.4 
of underidentification, 7.6.3 

of weak instruments, 7.6.4 , 7.6.10 

panel cointegration, 9.5.9 , 9.5.9 


panel unit roots, 9.5.9 , 9.5.9 
RESET test, 3.7.2 
robust tests, 7.7.5 
test of heteroskedasticity, 3.7.4 
spline regression, see regression splines 
spshape2dta command, 2.3.4 
sqreg command, 15.2.4 
st_addvar() Mata function, B.2.6 
standard error estimation, see variance-covariance matrix 
standardized regression, 3.11 
Stata documentation, 1.2.1 , 1.2.4 
Stata Journal, 1.2.2 
Stata pee Bulletin, 1.2.2 
tata () Mata function, B.1.3 
ta “7 prefix, 9.5.7 
) Mata function, B.2.1 
tepwise command, 11.3.7 
_matrix() Mata function, 3.9 , B.2.1 ,B.2.6 
string data, A Pe EE 
t store () ) Mata function, B.2.6 
t_view() Mata function, 3.9 , B.2.1 
ubsampling, see bootstrap methods 
uest command, 4.6.2 , 6.8.7 
ummarize command, 1.3.2 ; 1.6.1 5 323 
ummary statistics, see data summary examples 
ummcluster package, 6.4.6 
UR model, see seemingly unrelated regression equations 
ureg command, 6.8.2 
survey data, 3.8 , 3.8.4, 6.9 , 6.9.3 
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clustering, 6.9 
commands for survey data, 6.9.1 
complex survey data, 6.9 
stratification, 6.9 
weighted mean, 3.8.2 , 6.9.2 
weighted regression, 3.8.3 , 6.9.3 
weighting, 6.9 
svy prefix, 6.9 
Svy: regress command, 6.9.3 
svydescribe command, 6.9.1 
svymean command, 6.9.2 
svyset command, 6.9.1 
syntax, 1.3 , 1.3.1 
basic command syntax, 1.3 
parsing syntax, A.2.7 
syntax command, A.2.7 
systems of equations 
linear regressions, 6.8 , 6.8.7 
linear simultaneous equations, 7.10 , 7.10 
sysuse command, 1.3.2 
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T 
tab2 command, 3.2.5 
table command, 3.2.5 , 3.5. 


tables of frequencies, 3. 2, Me A 


tables of regression output 356 
in Word or LaTeX, 3.10 
tables of summary statistics, 3.2.6 , 3.2.6 
tabstat command, 3.2.6 
tabulate command, 2.4.3 , 2.4.7, 3.2.3 
tempvar command, A.2.4 
test command, 3.5.5 , 11.3.2 
test of overidentifying restrictions, see specification tests 
testnl command, 11.3.6 
testparm command, 11.3.2 
tests, see hypothesis tests 


text data, 2.2.1 


three-stage least squares, 7.10 , 7.10 
time-series data, 2.4.10 , 2.4.10 
within panel, 8.3.9 
time-series operators, 8.3.9 
tokens () Mata function, 3.9 
tostring command, 223 
trace command, A.3.3 
treatment effects 
difference in differences, 4.8 , 4.8.4 
quantile regression, 15.5, 15.5 
treatment—control comparison, 4.8.3 
tsset command, 2.4.10 , 5.2.1 
ttest command, 3.2.7 , 3.5.12 
compared with regress, 3.5.12 
two-sample 2SLS estimator, 7.9.4 
two-sample ¢ test, 3.5.12 
computed by OLS regression, 3.5.12 
two-stage least-squares estimator, see instrumental variables 
two-step estimator 
bootstrap standard error, 12.4.5 , 12.4.5 
stacked GMM standard error, 13.3.1] , 13.3.11 
two-way cluster-robust, 6.4.4 , 6.4.4 
twoway command, 2.6.5 
type command, 1.4.1 


U 
update command, 1.1 
use command, 2.3.2 


Vv 
variable names and labels, 2.4.2 , 2.4.2 
variance- covariance matrix, 3.4.7 , 3.4.7, 
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default estimate; 34.4 : "13.4.4 
dyad-robust, 6.4.5 
for FGLS, 6.2.3 , 6.2.4 


for IV, 7.3.6 
for m-estimator, 13.4.1 
for nonlinear estimators, 13.4 , 13.4.9 
for OLS, 3.4.4 , 3.4.7 
for SUR, 6.8.4 
for two-step estimator, 12.4.5 , 12.4.5 , 13.3.11 , 13.3.11 
HAC estimate, 13.4.7 
heteroskedastic—robust, 3.4.5 , 13.4.5 
jackknife estimate, 12.9.1 , 12.9.2 , 13.4.8 
outer product of the sadica, 13.4.4 
robust when to use, 13.3.1 , 13.4.2 
sandwich form, 13.4.1 
two-way cluster-robust, 6.4.4 , 6.4.4 
unconditional, 4.5.3 , 13.7.4 , 13.7.9 
vce (bootstrap) option, 10.3.3 
VCE, see variance-covariance matrix 
vce () option, see variance-covariance matrix 
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Chapter 16 
Nonlinear optimization methods 


16.1 Introduction 


Much of this book covers nonlinear regression models. In many cases, these 
can be fit using official Stata estimation commands such as logit and 
poisson, without knowledge of how the estimates are obtained. Then 
nonlinearity merely adds the complication of how to interpret the regression 
output. 


In this chapter, we explain the iterative methods used to obtain 
parameter estimates. The discussion can be relevant even when an official 
command is used. For example, it explains why an iteration log might 
include output that includes messages such as (backed up) or (not 
concave) and whether this is reason for concern. 


Additionally, we present methods to fit a model for which there is no 
official command. The mı command enables maximum likelihood (ML) 
estimation if, at a minimum, the log density formula is provided. This 
command is more generally applicable to other m estimators, such as the 
nonlinear least-squares (NLS) estimator. 


Not all models can be fit using the mi command. A wider range of 
models can be fit using the Mata moptimize() and optimize () functions 
for optimization; these functions are part of the Mata matrix programming 
language and are detailed in appendix C. The optimization tools in order of 
complexity are ml, moptimize(), and optimize (). All lead ultimately to 
computation using optimize () and, for equivalently defined models, the 
same numerical results. 


We illustrate the mı command methods using ML estimation of Poisson 
and negative binomial regression models. Application is to the same 
Medical Expenditure Panel Survey data example described more fully in 
section 10.2 and also used in chapter 13. The chapter concludes with a 
nonlinear generalized method of moments (GMM) example fit using the 
Mata optimize () function. 


16.2 Newton—Raphson method 


Estimators that maximize an objective function, such as the log likelihood, 
are obtained by calculating a sequence of estimates OL O», --. that move 
toward the top of the hill. Gradient methods do so by moving by an amount 
that is a suitable multiple of the gradient at the current estimate. A standard 
method is the Newton—Raphson (NR) method, which works especially well 
when the objective function is globally concave. 


16.2.1 NR method 


We consider the estimator g that is a local maximum to the objective 
function Q(@), so g solves 


where g(@) = 0Q(@)/06@ is the gradient vector. Numerical solution methods 
are needed when these first-order conditions do not yield an explicit solution 
for @. 


Let 0. denote the sth round estimate of @. Then a second-order Taylor- 
series expansion around @, approximates the objective function Q(@) by 


0°10) =0(0,) +s (8) (0-8) +} (0-3) (0, (0-8, 


where H = dg(0) /00’ = 07Q(0)/0000' is the Hessian matrix. 
Maximizing this approximating function with respect to Q leads to 


dQ* (0) /00 =g (6.) +H (6.) G 2 ô.) — 0 


Solving for o yields the NR algorithm 
P A al ya 
ôs =8, -H (8) g (8) (16.1) 


The parameter estimate is changed by a matrix multiple of the gradient 
vector, where the multiple is minus the inverse of the Hessian matrix. 


The final step in deriving (16.1) presumes that the inverse exists. If 
instead the Hessian is singular, then 0. +1 18 not uniquely defined. The 
Hessian may be singular for some iterations, and optimization methods have 
methods for still continuing the iterations. However, the Hessian must be 
nonsingular at the optimum. This complication does not arise if Q(@) is 
globally concave because then the Hessian is negative definite at all points 
of evaluation. In that case, the NR method works well, with few iterations 
required to obtain the maximum. 


16.2.2 NR method for Poisson 


The Poisson model is summarized in section 13.2.2. As noted at the start of 
section 13.2, the fact that we are modeling a discrete random variable places 
no restriction on the generality of the example. Exactly the same points 
could be illustrated using, for example, complete spell duration data modeled 
using the exponential or even a nonlinear model with normally distributed 
errors. 


For the Poisson model, the objective function, gradient, and Hessian are, 
respectively, 


Q(B) =EN {- exp(xB) + yx — In y:!} 
g(B) = {yi — exp(xB)} x: (16.2) 
H(B) =X, —exp (x8) xix’ 


Note H(3) = —X’'DX, where X is the N x K regressor matrix and 

D = Diag{exp(x/3)} is an N x N diagonal matrix with positive entries. It 
follows that if X is of full rank, then H (6) is negative definite for all G, and 
the objective function is globally concave. Combining (16.1) and (16.2), we 
see the NR iterations for the Poisson maximum-likelihood estimator (MLE) are 


=i 
Ba =A, + [Deo (xB,) mai J {nex (Wi) 


w=1 


16.2.3 Poisson NR example using Mata 


To present an iterative method in more detail, we manually code the NR 
algorithm for the Poisson model, using Mata functions that are explained in 
appendix B. The same example is used in subsequent sections to 
demonstrate use of the Stata m1 command and the Mata optimize() and 
moptimize() functions. 


Core Mata code for Poisson NR iterations 


For expositional purposes, we begin with the Mata code for the core 
commands to implement the NR iterative method for the Poisson. 


We assume that the regressor matrix x and the dependent variable vector 
y have already been constructed. Iterations stop when 
g(B,)'H(B,) 1g(B,) < 1078, and output provides an iteration log. 


* Core Mata code for Poisson MLE NR iterations 


mata 
p = cols(X) // Number of regressors 
b = Jp, 1, «02) // Start values are close to zero 
gHinvg = 1 // Initialize scaled gradient 
iter =0 // Initialize number of iterations 
do { 
mu = exp(X*b) 
grad = X°*(y-mu) // k x 1 gradient vector 
hes = cross(X, mu, X) // Negative of the k x k Hessian matrix 
diff = cholinv(hes) *grad // Update amount 
bold = b 
b = bold + diff 
iter = iter + 1 
gHinvg = grad°*diff // Equals grad“*cholinv(hes) *grad 
printf("iter = %s, gHginv = %6.5g\n", strofreal(iter), gHinvg) 
} while (gHinvg > 1e-8) // End of iteration loops 
end 


The N x 1 vector mu has the jth entry u; = exp(x,). The K x 1 vector 


grad equals ya — ui)Xi and hes=cross(X, mu, X) equals >), 4iXiX; 


(7 


The cross () function ensures a numerical result that is a symmetric 
matrix. The quickest function for taking a matrix inverse is cholinv() fora 
symmetric positive-definite matrix. Thus, we set hes to — H (B), which is 
positive definite, but then the NR update has a plus sign rather than the minus 
sign in (16.1). In some cases, though not here given use of the cross () 
function, hes may not be symmetric because of a rounding error. Then we 
would add the hes = makesymmetric (hes) command before calling 
cholinv(). 


Complete Stata and Mata code for Poisson NR iterations 


The complete program has the following sequence: 1) in Stata, obtain the 
data, and define any macros used subsequently; 2) in Mata, calculate the 
parameter estimates and the estimated variance—covariance matrix of the 
estimator (VCE), and pass these back to Stata; and 3) in Stata, output nicely 
formatted results. 


We begin by reading in the data and defining the global macro y for the 
dependent variable and the global macro xlist for the regressors. 


. * Set up data and local macros for dependent variable and regressors 
. qui use mus210mepsdocvisyoung 


. keep if year02 == 
(25,712 observations deleted) 


. generate cons = 1 
. global y docvis 


. global xlist private chronic female income cons 


The dependent variable is the number of office-based physician visits 
(docvis) by persons in the United States aged 25—64 years. The regressors 
used are health insurance status (private), health status (chronic), and 
socioeconomic characteristics (female and income). 


The subsequent Mata program reads in the relevant data and obtains the 
parameter estimates and the estimate of the vce. The program first associates 
vector y and matrix x with the relevant Stata variables by using the 
st_view() function. The tokens ("") function is added to convert $xlist to 
a comma-separated list with each entry in double quotes, the necessary 
format for st_view(). The starting values are simply zero for slope 
parameters and one for the intercept. The robust estimate of the VCE is 
obtained, and this and the parameter estimates are passed back to Stata by 
using the st_matrix() function. We have 


* Complete Mata code for Poisson MLE NR iterations 


. ma 


VN NNNNNNyNVNďVV*V *©œ 


H 
Gu 
0) 
K 


iter 
iter 
iter 
iter 
iter 
iter 


: en 


mata (type end to exit) 
Read in stata data to y and X 


Starting values are close to zero 
Initialize scaled gradient 


Initialize number of iterations 


Negative of the k x k Hessian matrix 


Equals grad°*cholinv (hes) *grad 


End of iteration loops 


Pass results from Mata to Stata 


Pass results from Mata to Stata 


ta: 
st_view(y=., ., "$y") // 
st_view(X=., ., tokens("$xlist")) 
p = cols(X) // Number of regressors 
n = rows (X) 
b = J(p, 1, .02) // 
gHinvg = 1 // 
iter =0 // 
do { 
mu = exp(X*b) 
grad = X°*(y-mu) // k x 1 gradient vector 
hes = cross(X, mu, X) // 
diff = cholinv(hes)*grad // Update amount 
bold = b 
b = bold + diff 
iter = iter + 1 
gHinvg = grad“*diff // 
printf("iter = %s, gHginv = %6.5g\n", strofreal(iter), gHinvg) 
} while (gHinvg > 1e-8) // 
= 1, gHginv = 2.2e+04 
= 2, gHginv = 1.3e+04 
= 3, gHginv = 1648 
= 4, gHginv = 52.48 
= 5, gHginv = .0808 
= 6, gHginv = 3.2e-07 
= 7, gHginv = 8.9e-18 
mu = exp(X*b) 
hes = cross(X, mu, X) 
vgrad = cross(X, (y-mu):°2, X) 
vb = cholinv(hes) *vgrad*cholinv (hes) *n/(n-cols(X)) 
st_matrix("b",b~) // 
st_matrix("V",vb) // 
d 


Once back in Stata, we use the ereturn command to display the results, 
first assigning names to the columns and rows of p and v. We have 


* Present results, nicely formatted using Stata command ereturn 
matrix colnames b = $xlist 


matrix colnames V = $xlist 


matrix rownames V = $xlist 


ereturn post b V 


. ereturn display 


Coefficient Std. err. z P>|z| [95% conf. interval] 

private . 7986654 . 1090509 7.32 0.000 . 5849295 1.012401 
chronic 1.091865 .0560205 19.49 0.000 . 9820669 1.201663 
female .4925481 .058563 8.41 0.000 .3777666 . 6073295 

income . 003557 .001083 3.28 0.001 .0014344 .0056796 

cons -.2297263 . 1109236 -2.07 0.038 -.4471325 -.0123202 


The coefficients are the same as those from the poisson command (see 
section 13.3.2), and the standard errors are the same to at least the first three 


significant digits. The program required seven iterations. 


The preceding NR algorithm can be adapted to use Stata matrix 
commands, but it is better to use Mata functions because these can be 
simpler. Also, Mata functions read more like algebraic matrix expressions, 
and Mata does not have the restrictions on matrix size that are present in 


Stata. 


16.3 Gradient methods 


In this section, we consider various gradient methods, stopping criteria, 
multiple optimums, and numerical derivatives. The discussion is relevant 
for official estimation commands, as well as for community-contributed 
commands. 


16.3.1 Maximization options 


Stata ML estimation commands, such as poisson, and the general-purpose 
ml command, presented in the next section, have various maximization 
options that are detailed in [R] Maximize. 


The default is to provide an iteration log that gives the value of the 
objective function at each step plus information on the iterative method 
being used. This can be suppressed using the nolog command. Additional 
information at each iteration can be given by using the trace (current 
parameter values), gradient (current gradient vector), hessian (current 
Hessian), and showstep (report steps within each iteration) options. 


The technique () option allows several maximization techniques other 
than NR. The nr, bhhh, dfp, bfgs, and gn options are briefly discussed in 
section 16.3.2. 


Three stopping criteria, which are the options tolerance (#), 
ltolerance(#), and nrtolerance (#), are discussed in section 16.3.4. The 
default is nrtolerance(le-5). 


The difficult option uses an alternative method to determine steps 
when the estimates are in a region where the objective function is 
nonconcave. 


The from(init specs) option allows starting values to be set. 


The maximum number of iterations can be set by using the iterate (#) 
option or by the separate command set maxiter #. The default is 300, but 


this can be changed. 
16.3.2 Gradient methods 


Stata maximization commands use the iterative algorithm 
6.41=0,+4,W,g,, s=1,...,8 (16.3) 


where as is a scalar step-size adjustment and W, is a q x q weighting 
matrix. A special case is the NR method given in (16.1), which uses — H7! 
in place of a, W.,. 


If the matrix multiplier W, is too small, we will take a long time to 
reach the maximum, whereas if a multiple is too large, we can overshoot the 
maximum. The step-size adjustment as is used to evaluate Q(6, 41) at 
Oo41 = 0. + as Wg, Over a range of values of as (such as 0.5, 1, and 2), 
and the value of a, that leads to the largest value for Q(6, 41) 18 chosen. 
This speeds up computation because calculation of Wg, takes much more 
time than several subsequent evaluations of (Q(@,_,,). It also ensures that 
the algorithm refuses steps that would decrease Q(@). Stata uses a 
sophisticated method to choose as so that convergence occurs quickly, even 
for difficult problems. 


Different weighting matrices W, correspond to different gradient 
methods. Ideally, the NR method can be used, with W, = —H,!. If H, is 
nonnegative definite, noninvertible, or both, then H, is adjusted so that it is 
invertible. Stata also uses W, = —{H, + cDiag(H,)}~1. If this fails, then 
Stata uses NR for the orthogonal subspace corresponding to nonproblematic 
eigenvalues of H, and steepest ascent (W, = I,) for the orthogonal 
subspace corresponding to problematic (negative or small positive) 
eigenvalues of H,. 


Other optimization methods can also be used. These methods calculate 
alternatives to H7 t that can be computationally faster and can be possible 
even in regions where H, is nonnegative definite, noninvertible, or both. 


The alternative methods available for the m1 command are the Berndt—Hall— 
Hall—Hausman, Davidon—Fletcher—Powell, and Broyden—Fletcher— 
Goldfarb—Shanno algorithms. These methods can be selected by specifying 
the technique () as, respectively, bhhh, dfp, and bfgs. Additionally, some 
estimators with a quadratic objective function, such as GMM, are obtained 
using technique (gn), which uses the Gauss—Newton algorithm. The 
methods are explained in, for example, Cameron and Trivedi (2005, 

chap. 10), Greene (2018, app. E), Wooldridge (2010, chap. 12.7), and 
Gould, Pitblado, and Poi (2010, chap. 1). 


Some of these algorithms can converge even if H, is still nonnegative 
definite. Then one can obtain parameter estimates but not standard errors 
because the latter requires inversion of the Hessian. The lack of standard 
errors 1s a clear signal of problems. 


16.3.3 Messages during iterations 


The iteration log can include comments on each iteration. 


The message (backed up) is given when the original step size as in 
(16.3) resulted in a lower Q(6, 41): The message (not concave) means 
that — H, was not invertible. In both cases, the ultimate results are fine, 
provided that these messages are not being given at the last iteration. When 
(not concave) is displayed at every iteration, however, the model is almost 
certainly not identified; see section 16.3.5. 


16.3.4 Stopping criteria 


The iterative process continues until it is felt that g(0) ~ 0 and that Q(6) is 
close to a maximum. 


Stata has three main stopping criteria: these are the small change in the 
coefficient vector (tolerance () ); the small change in the objective function 
(1tolerance () ); and the small gradient relative to the Hessian 
(nrtolerance()). The Stata default values for these criteria can be 
changed; see help maximize. 


The default and preferred stopping criterion is nrtolerance (), which is 
based on g(6)'H(6)~!g(@). The default is to stop when nrtolerance () 


< 1075. 


In addition, the user should be aware that even if the iterative method 
has not converged, estimation will stop after maxiter iterations. If the 
maximum is reached without convergence, regression results, including 
parameters and standard errors, are still provided, along with a warning 
message that convergence is not achieved. 


16.3.5 Nonconvergence and possible lack of identification 


Parameter estimates should not be used if iterations fail to converge. They 
also should not be used if the final iteration has a warning that the objective 
function is nonconcave or that the Hessian is not negative definite, because 
this indicates that the model is not identified. Missing standard errors also 
indicate a problem. 


Failure of iterative methods to converge may indicate numerical 
problems, due to poor starting values or a highly nonlinear model that is 
challenging to fit, that are potentially solvable. For example, convergence 
of estimators for the parametric models presented in chapter 23 might 
require that the number of numerical integration points be increased from 
program defaults. 


But nonconvergence may also indicate that the model is not identified. 
With more complex models, fitting a model that is not identified becomes 
more likely. An example is if too many components are specified for a 
finite mixture model. 


16.3.6 Multiple maximums 


Complicated objective functions can have multiple optimums. The 
following provides an example: 


. * Objective function with multiple optima 

graph twoway function 
y=100-0 . 0000001 (x-10) * (x-30) * (x-50) * (x-50) * (x-70) * (x-80) , 
range (5 90) plotregion(style(none)) scale(1.2) 
title("Objective function Q({ktheta}) as {ktheta} varies") 
xtitle("{&theta}", size(medlarge)) xscale(titlegap(*5) ) 
ytitle("Q({ktheta})", size(medlarge)) yscale(titlegap(*5) ) 
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Figure 16.1. Objective function with multiple optimums 


From figure 16.1, there are three local maximums—at 6 ~ 15, at 0 ~ 50 
, and at 0 ~ 75—and two local minimums—at @ ~ 35 and at 6 ~ 65. Most 
econometrics estimators are defined as a local maximum. It is assumed that 
the maximum is unique or that the optimizer is searching in the 
neighborhood of the largest of the local maximums. In this example, this 
approach applies to the largest of the local maximums, which is @ ~ 15. 


What problems might a gradient method encounter? If we start at 
0 < 30, we will eventually move to the desired optimum at @ ~ 15. If 
instead we start at 0 > 30, then we will move to smaller local maximums at 
6 = 50 or 6 ~ 75. Furthermore, the objective function is relatively flat for 
30 < 8 < 80, so it may take quite some time to move to a local maximum. 


Even if one obtains parameter estimates, they need not provide the 
largest local maximum. One method to check for multiple optimums is to 
use a range of starting values. This problem is more likely with community- 
contributed estimators because most official Stata commands apply to 
models where multiple optimums do not arise. 


16.3.7 Numerical derivatives 


All gradient methods require first derivatives of the objective function, and 
most require second derivatives. For the q x 1 vector ø, there are q first 
derivatives and q(q + 1)/2 unique second derivatives that need to be 
calculated for each observation at each round of the iterative process, so a 
key component is fast computation of derivatives. 


The derivatives can be computed analytically or numerically. Numerical 
derivatives have the attraction of simplicity but can lead to increased 
computation time compared with analytical derivatives. For the Poisson 
example in section 13.2.2, it was easy to obtain and provide analytical 
derivatives. We now consider numerical derivatives. 


The scalar derivative df (x)/dx = lim [F(x +h) — f(z—h)}/2h], so 
one can approximate the derivative by {f(x + h) — f(x — h)}/2h fora 
suitable small choice of h. Applying this to the optimization of Q(0), where 
differentiation is now with respect to a vector, for the first derivative of 
Q(0,) with respect to the jth component of the vector 9, the numerical 
derivative is 
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where h is small and e; = (0...0 1 0...0)’ is a column vector with unity 
in the jth row and Os elsewhere. Numerical second derivatives are 
calculated as the numerical first derivative of the numerical or analytical 
first derivative. In theory, } should be very small because formally 


OQ(0@)/06; equals the limit of A@(@)/A@,; as h — 0. But in practice, too 
small a value of h leads to inaccuracy due to rounding error. Stata chooses 
2h so that f(x + h) and f(x — h) differ in about half their digits, or 
roughly 8 out of 16 digits, because computations are in double precision. 
This computation of p each time a derivative is taken increases accuracy at 
the expense of considerable increase in computation time. 


The number of derivatives is greatly reduced if the objective function is 
an index model with few indexes. In the simplest case of a single-index 
model, Q(@) = N~! X; q(yi, x;0) so that @ appears only via x/@. Then, 
by the chain rule, the gradient vector is 


aQ) 1 3 Oq(yi,x/0) 
Ox’. a 


ale] N 4 
i=1 


The q scalar derivatives Oq(y;, x;0)/30; are the same scalar derivative 
Oq(y:, x0) /Ox, times Xi. Similarly, the q(q + 1)/2 unique second 
derivatives are simple multiples of a scalar second derivative. 


For a multi-index model with J indexes (often J < 2), there are J first 
derivatives to be calculated. Because few derivatives need to be computed, 
computation is slowed down little when numerical derivatives are used 
rather than analytical derivatives if J is small. 


16.3.8 Constraints on parameters 


Common constraints on parameters are that a variance should be positive 
and a correlation parameter should be between — 1 and 1. These constraints 
can be accommodated by estimating ]n g rather than o and estimating atanh 
p rather than p, where atanhp = (1/2) In{(1 + p)/(1 — p)}. Then, 
retransform to yield estimates G and p. 


16.4 Overview of ml, moptimize(), and optimize() 


Stata has three general optimization tools for user-defined objective 
functions. The m1 command is the simplest and requires only use of Stata 
commands. The moptimize() and optimize() functions, by contrast, are 
coded in Mata and can be much faster than m1. 


The mı command and the moptimize() function are designed for single- 
and multiple-index models, and the m1 command is a Stata command front- 
end to the Mata function moptimize(). The optimize() function is intended 
for quite general objective functions that need not be index functions. 


Unless computational time is a major consideration, it is best to use the 
simplest method possible. Note that for some objective functions, one can 
use the simpler mlexp command; see section 13.3.3. This has the advantage 
that postestimation commands such as margins and predict are then simple 
to implement. 


16.4.1 Evaluator types 


Each one of ml, moptimize(), and optimize() has several ways of defining 
the objective function: Stata refers to these as evaluator types. These vary 
according to whether 


1. the objective function sums over a distinct entry for each observation, 
so Q(0) = ree (9): In this case, an evaluator function can provide 


individual observations q;(0). 
2. the parameters enter as indexes (single or multiple). 
3. analytical expressions for first and second derivatives are provided. 


When 1 holds, the evaluator can define the components q;(0), enabling 
computation of robust standard errors. When 2 holds, computation is sped 
up, as explained in the preceding subsection. And 3 also speeds up 
computation. For example, we use 1£0 if no derivatives are provided, 1f1 if 
first derivatives are provided, and 1£2 if first and second derivatives are 
provided. 


The 1£, 1£0, 1£1, and 1£2 methods, or linear-form methods, are used 
when both | and 2 hold. The most general model is then a multi-index model 
with J dependent variables and m indexes that specifies 


QO) = So lyi, Yi X15 ++ Xin Om) 


1=1 


where Stata automatically includes a constant in each of X1;,-.--, Xmi. For 
these methods, the evaluator defines each individual contribution q;(@), and 
any derivatives provided are with respect to q;(0). 


The do, a1, and a2 methods, or derivative methods, are used when 2 
holds but 1 does not. This is the case, for example, for estimation of panel 
models or clustered-data models with fixed or random effects, so that with G 
clusters, say, the objective function Q(0) sums over G terms rather than N 
terms. Then the evaluator defines Q(0), and any derivatives provided are 
with respect to Q(0). 


Robust standard errors can be directly obtained using the 1£0—1£2 
methods but not using the ao—a2 methods. The gf0, gf1, and gf2 methods, 
or general-form methods, can yield robust standard errors even when 1 does 
not hold. 


The q0 and q1 methods, or quadratic form methods, are for the special 
case where the objective function is a quadratic form Q(@) = h(@)/Wh(@) 
as used in NLS and GMM estimation. 


Table 16.1 provides a summary of the various evaluators and the 
optimization tools for which they are available. 


Table 16.1. Evaluator types for m1, moptimize(), and optimize () 


Evaluator ml moptimize() optimize() Description 

lf v v linear form 

1f£0 V V linear form 

1f1 V V 1f0 with gradient provided 
1f2 v v 1f1 with Hessian provided 
do V V V derivative 

d1 V V V d0 with gradient provided 
d2 v v v d1 with Hessian provided 
gfo V V V general form 

gf1 V V gf0 with gradient provided 
gf2 V V gf1 with Hessian provided 
qo V quadratic form 

q1 V q0 with derivative provided 


Where it is relevant, the additional evaluator types 1f1debug, 1f2debug, 
didebug, d2debug, gfldebug, gf2debug, and q1debug provide numerical 
checks that the relevant derivatives have been correctly provided. 


The use of survey commands and computation of robust and cluster— 
robust standard errors are possible with 1£*, gf*, and q* evaluators but not 
with a+ evaluators. 


16.4.2 Optimization techniques 


Most of the gradient methods discussed in section 16.3.2 are available. 


Table 16.2 provides a summary. 


Table 16.2. Optimization techniques for evaluator types for m1, 
moptimize(), and optimize () 


Evaluator ml moptimize() optimize() Description 


nr V v v Modified NR 

dfp V V V Davidon-Fletcher-Powell 

bfgs V V V Broyden-Fletcher- 
Goldfarb-Shanno 

bhhh V V V Berndt-Hall-Hall-Hausman 

nm v v Nelder-Mead 

gn v Gauss-Newton 


(quadratic optimization) 


The default is nr, except for the moptimize () evaluators q0 and q1, 
which use gn. Note that for a given optimization tool, not all optimization 
techniques will be available for all evaluators. For example, for the 
optimize () function and type a evaluators, the bhhh and gn techniques are 
not available. 


16.5 The ml command: If method 


The Stata optimization command m1 focuses on multi-index models to speed 
up the computation of derivatives. The name m1 is somewhat misleading 
because the command can be applied to any m estimator (see the NLS 
example in section 16.5.5), but in non-ML cases, one should always use a 
robust estimate of the VCE. 


The 1£ method is the simplest method. It requires the formula for a 
single observation’s contribution to the objective function. For ML 
estimation, this is the log density. The more advanced methods 1£0—1£2, do- 
d2, 1f0—1 f2, and gf0 are deferred to section 16.7. 


16.5.1 The ml commands 


The key mı commands are the m1 model command to define the model to be 
fit and the m1 maximize command to perform the maximization. 


The syntax for ml model is 


ml model method progname eq1 | eq2 aw | [ af | [ in | | weight | E options | 


For example, ml model 1f 1fpois (y=x1 x2) will use the 1fpois program to 
estimate the parameters of a single-index model with the dependent variable 
y, the regressors x1 and x2, and an intercept. 


The 1£, a0, and 1£0 methods use only numerical derivatives, the a1 and 
1£1 methods use analytical first derivatives and numerical second 
derivatives, and the a2 and 1£2 methods use only analytical derivatives. The 
user must provide the formulas for any analytical derivatives. Finally, for 
methods a2 and 1£2 that provide expressions for second derivatives, the 
negh option is used if the negative Hessian rather than the Hessian is 
specified. 


The syntax for ml maximize is 


ml maximize E options ] 


where many of the options are the maximization options covered in 
section 16.3.1. 


There are several other mı commands. These include m1 check to check 
that the objective function is valid; m1 search to find better starting values; 
ml trace to trace maximization execution; and ml init to provide starting 
values. 


16.5.2 The If method 


The simplest mı method is the 1£ method. This is intended for the special 
case where the objective function is an m estimator, simply a sum or average 
over the Ņ observations of a subfunction q;(0), with parameters that enter as 
a single-index or a multi-index form. Then, 


N 
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Usually, J = 1, in which case q;(0) = q(yi, x;,91), or J = 2. Most cross- 
sectional likelihoods and official Stata commands fall into this class, which 
Stata documentation refers to as meeting the linear-form restrictions. 


The 1£ method requires that a program be written to give the formula for 
the subfunction q;(@). This program is subsequently called by the m1 model 
1£ command. 


The Stata documentation refers to the first index x; 01 as theta1, the 
second index x’.,82 as thet a2, and so on. This has the potential to cause 
confusion with the standard statistical terminology, where @ is the generic 
notation for the parameter vector. 


16.5.3 Poisson example: Single-index model 


For the Poisson MLE, Q() = $; qi(), where the log density 


qi(B) = — exp(x;8) + yx, — ny! 


is of single-index form. 


We first write the program, referenced in m1 model, that evaluates q;(3). 
This program has two arguments: inf, for the evaluated log density for an 
individual observation, and theta, and the single index x/ 6. 


For all mı programs, the dependent variable y: is assigned by Stata to the 
global macro $mL_y1. A second dependent variable would be assigned 
SML_y2, and so on. 


To improve program readability, we use the local macro y to substitute 
for smi. y1, we define the temporary variable mu equal to exp (thetal), and 
we define the temporary variable Inyfact equal to In y!. The program 
argument 1nf stores the result q;(): 


. * Poisson ML program lfpois to be called by command ml method 1f 
. program lfpois 


1. version 17 

2. args Inf thetal // thetal=x°b, 1Inf=ln(y) 

3. tempvar lnyfact mu 

4. local y "$ML_y1" // Define y so program more readable 
5. generate double “lnyfact” = lnfactorial(y’) 

6. generate double `mu’ = exp(`theta1”) 

7. qui replace ~lnf~ = -`mu” + “y°* thetal” - “lnyfact~ 

8. end 


We could have more directly defined inf as 


“Inf° = -exp(*thetal~) + $ML_y1*exp(~theta1’) - Infactorial($ML_y1) 


The preceding code instead breaks this into pieces, which can be 
advantageous when inf is complex. Stata computes inf using double 
precision, so the intermediate variables should also be calculated in double 
precision. The infactorial() function is used rather than first computing y! 
and then taking the natural logarithm because the latter method is not 
possible if y is at all large. 


The essential commands are m1 model and ml maximize. It is good 
practice to additionally use the ml check and ml search commands before 


ml maximize. 


For Poisson regression of docvis on several regressors, with a 
heteroskedastic—robust estimate of the VCE, we have 


. * Command ml model, including defining y and x, plus ml check 
. ml model 1f 1lfpois (docvis = private chronic female income), vce(robust) 


. ml check 


Test 1: Calling lfpois to check if it computes log pseudolikelihood and 
does not alter coefficient vector... 
Passed. 


Test 2: Calling lfpois again to check if the same log pseudolikelihood value 
is returned... 
Passed. 


(output omitted ) 


We then type ml search to try to obtain better starting values. 


* Search for better starting values 
. ml search 
initial: log pseudolikelihood 
rescale: log pseudolikelihood 


-32853.851 
-23375.641 


ML estimation then occurs by typing 


* Command lfpois 


. ml maximize 
initial: 

rescale: 

Iteration 
Iteration 
Iteration 
Iteration 
Iteration 
Iteration 


oP WNF O 


log 
log 
log 
log 
log 
log 
log 
log 


implemented for Poisson MLE 


pseudolikelihood 
pseudolikelihood 
pseudolikelihood 
pseudolikelihood 
pseudolikelihood 
pseudolikelihood 
pseudolikelihood 
pseudolikelihood 


Log pseudolikelihood = -18503.549 


docvis 


private 
chronic 
female 
income 
_cons 


Robust 


Coefficient std. err. 


. 7986654 . 1090015 


1 


091865 .0559951 


. 4925481 .0585365 


. 003557 .0010825 


-.2297263 . 1108733 


-23375. 
-23375. 
-23375. 
-21094. 


-18511 


-18503. 
-18503. 
-18503. 


641 
641 
641 
397 


.276 


552 
549 
549 


P>lz| 


0.000 
0.000 
0.000 
0.001 
0.038 


Number of obs 
Wald chi2(4) 
Prob > chi2 


[95% conf. 


.5850265 
.9821167 
.3778187 
.0014354 
- .4470339 


= 4,412 
594.72 
0.0000 


interval] 


1.012304 
1.201614 
.6072775 
. 0056787 
-.0124188 


Note that an intercept was automatically added. The results are the same as 
those given in section 13.3.2 using poisson and as those obtained in 
section 16.2.3 using Mata. 


The following command yields cluster-robust standard errors, with 
clustering on age. 


. * Same model but with cluster-robust standard errors 


. ml model 1f lfpois (docvis = private chronic female income), vce(cluster age) 


. ml maximize 


(output omitted) 


16.5.4 Negative binomial example: Two-index model 


A richer model for counts is the negative binomial. The log density for this 
model, which is explained in section 20.2.2, is 


qi(B, a) = nT (y: +a7}) — InT (a) —Iny;! 


— (yi +a") ln {1 + a exp (x;B)} + yi lna + yix; B 


This introduces an additional parameter, a, so that the model is now a 
two-index model, with indexes x’ 3 and a. 


The following program computes q;(G, a) for the negative binomial, 
where the two indexes are referred to as thetal (equals x’ 3) and a (equals 
a). 


. * Negbin ML two-index program lfnb to be called by command ml method lf 
. program 1fnb 


1. version 17 
2. args Inf thetal a // thetal=x°b, a=alpha, 1Inf=ln(y) 
3. tempvar mu 
4. local y $ML_y1 // Define y so program more readable 
5. generate double “mu” = exp( thetal”) 
6. qui replace “Inf° = lngamma(~y~+(1/"a~)) - Ingamma((1/~a‘“)) 
> - Infactorial(-y°) - (Cy°+(1/°a°))*1n(1+°>a°*> mu’) 
> + ~y°*ln(-a~) + `y’ *lnCmu’) 
7. end 


The program has an additional argument, a, so the call to the program 
using ml model includes an additional argument () indicating that a is a 
constant that does not depend on regressors, unlike theta1. We have 


* Command 1f implemented for negative binomial MLE 
. ml model 1f lfnb (docvis = private chronic female income) (), vce(robust) 


. ml maximize, nolog 


initial: log pseudolikelihood = -<inf> (could not be evaluated) 
feasible: log pseudolikelihood = -14722.779 
rescale: log pseudolikelihood = -10743.548 
rescale eq: log pseudolikelihood = -10570.445 
Number of obs = 4,412 
Wald chi2(4) = 384.30 
Log pseudolikelihood = -9855.1389 Prob > chi2 = 0.0000 
Robust 
docvis | Coefficient std. err. z P>|zl [95% conf. interval] 
eqi 
private . 8876559 . 1284168 6.91 0.000 .6359636 1.139348 
chronic 1.143545 0614412 18.61 0.000 1.023122 1.263967 
female . 5613027 . 0663056 8.47 0.000 -4313461 6912593 
income 0045785 0010666 4.29 0.000 . 002488 . 006669 
_cons - .4062135 . 1493729 -2.72 0.007 -.698979 -.1134479 
eq2 


_cons 1.726868 0796025 21.69 0.000 1.57085 1.882886 


The estimates are the same as those obtained by using the noreg command 
in section 11.4.1; the standard errors differ because heteroskedastic—robust 
standard errors were used here. 


16.5.5 NLS example: Nonlikelihood model 


The preceding examples were likelihood based, but other m estimators can 
be considered. 


In particular, consider NLS estimation with an exponential conditional 
mean. Then Qu (8) = 1/N >, {yi — exp(x/3)}”- This is easily 
estimated by typing 


. * NLS program lfnls to be called by command ml method 1f 
. program lfnls 


1. version 17 

2. args Inf thetal // thetal=x°b, lnf=squared residual 
3. local y "$ML_y1" // Define y so program more readable 
4. qui replace “Inf = -("y°-exp("thetal~))“2 

5. end 


Note the minus sign in the definition of 1nf because the program is designed 
to maximize, rather than minimize, the objective function. 


Running this program, we obtain 


. * Command 1f implemented for NLS estimator 
. ml model 1f lfnls (docvis = private chronic female income), vce(robust) 


. ml maximize 


(output omitted ) 


The results, omitted here, give the same coefficient estimates as those 
obtained from the n1 command, which are given in section 13.3.6. The 
corresponding robust standard errors differ, however, by as much as 5%. The 
reason is that n1 uses the expected Hessian in forming the robust estimate of 
the VCE (see section 13.4.5), exploiting additional information about the NLS 
estimator. The mı method instead uses the empirical Hessian. For NLS, these 
two differ, whereas for Poisson MLE, they do not. 


For this example, the default estimate of the vce, the inverse of the 
negative Hessian matrix, will always be wrong. To see this, consider 


ordinary least squares (OLS) in the linear model. Then 

Qn (GB) = (y — XB} (y — XB) has a Hessian of — 2 x X’X. Even if the 
errors are homoskedastic, m1 would give an estimate of (1/2)(X’X)7! 
rather than s2(X’X)~'. Whenever m1 is used to optimize models that are not 
likelihood based, a robust estimate of the VCE must be used. 


16.6 Checking the program 


The initial challenge is to debug a program and get it to successfully run, 
meaning that iterations converge and plausible regression output is obtained. 
There is a great art to this, and there is no replacement for experience. There 
are many ways to make errors, especially given the complexity of program 
syntax. 


The next challenge is to ensure that computations are done correctly to 
verify that plausible output is indeed correct output. This is feasible if one 
can generate simulated data that satisfy model assumptions. 


We focus on community-contributed programs for m1, but many of the 
following points apply to evaluating any estimator. 


16.6.1 Program debugging using ml check and ml trace 


The ml check command provides a check of the code to ensure that one can 
evaluate inf, though this does not ensure that the evaluation is correct. 


This command is most useful for checking program syntax because it 
provides much more detailed information than if we instead proceed directly 


to ml maximize. 


For example, suppose in the 1fpois program we typed the line 


. generate double “mu” = exp(`heta1’) 


The mistake is that ‘netai’ was typed rather than ‘thetai’. The mı 
maximize command leads to failure and the following error message: 


invalid syntax 
r(198); 


This message is not particularly helpful. If instead we type 


. ml search 


before ml maximize, the program again fails, but the output now includes 


- generate double “mu” = exp(~heta1”) 
= generate double __000006 = exp( 
invalid syntax 


which indicates that the error is due to a problem with ‘netal’. 


More complete information is given by the ml trace command. If we 
type 


. ml trace on 


before m1 maximize, the program fails, and we get essentially the same output 
as when using ml search. Once the program is corrected and runs 
successfully, m1 trace provides extensive details on the execution of the 
program. In this example, 980 lines of detail are given. 


The trace facility can also be used for commands other than m1 by typing 


. set trace on 


A drawback of using trace is that it can produce copious output. 


A more manual-targeted method to determine where problems may arise 
in a program is to include messages in the program. For example, suppose we 
place in the 1fpois program the line 


display "I made it to here" 


If the program fails after this line is displayed, then we know that the 
problem arose beyond the line where the display statement was given. 


16.6.2 Getting the program to run 


The ml check command essentially checks program syntax. This does not 
protect against other coding errors such as misspecification of the log density. 
Suppose, for example, that in the 1fpois program we typed 


quietly replace `lnf” = “mu” + “y~*ln(~mu~) - “lnyfact~ 


The error here is that we have ‘mu’ rather than - ‘mu’. Then we obtain 


. ml maximize 
initial: log pseudolikelihood = -25075.609 


alternative: log pseudolikelihood = -13483.451 
(4412 missing values generated) 


rescale: log pseudolikelihood = 1.01e+226 

Iteration 0: log pseudolikelihood = 1.01e+226 (not concave) 
(1 missing value generated) 

Iteration 1: log pseudolikelihood = 1.76e+266 (not concave) 


(762 missing values generated) 
Hessian has become unstable or asymmetric (NC) 
r (504) ; 


Here the error has occurred quite early. One possibility is that poor starting 
values were given. But using m1 search leads to far worse starting values in 
this example. In this case, the most likely explanation is an error in the 
objective function. 


Poor starting values can lead to problems if the objective function is not 
globally concave. For index models, a good approach is to set all parameters 
to zero aside from the constant, which is set to that value appropriate for an 
intercept-only model. For example, for the Poisson intercept-only model with 
the parameter a, we have @ = In y because then exp(@) = y. Thus, the initial 
value is (0, ..., In y). It can be useful to try the model with few regressors, 
such as with an intercept and a single regressor. 


The mı methods d1, a2, 1f1, and 1f2, presented in section 16.7, require 
analytical expressions for the first and second derivatives. The m1 model 
didebug and ml model d2debug commands check these by comparing these 
expressions with numerical first and second derivatives. Any substantial 
difference indicates error in coding the derivatives or, possibly, the original 
objective function. There are similar methods 1f£1debug and 1f£2debug. 


16.6.3 Checking the data 


A common reason for program failure is that the program is fine but that the 
data passed to the program are not. 


For example, the 1fpois program includes the Infactorial (‘y’) 
function, which requires that ‘`y’ be a nonnegative integer. Consider the 
impact of the following: 


. replace docvis = 0.5 if docvis == 
. ml model 1f lfpois (docvis = private chronic female income), vce(robust) 


. ml maximize 
The resulting output after ml maximize includes many lines with 


(700 missing values generated) 


followed by 


could not find feasible values 
r(491); 


One should always use summarize to obtain summary statistics of the 
dependent variable and regressors ahead of estimation as a check. Indications 
of problems include an unexpected range, zero standard deviation, and 
missing values. In this particular example, summarize will not detect the 
problem, but tabulate docvis will. 


16.6.4 Multicollinearity and near collinearity 


If variables are perfectly collinear, then Stata estimation commands, 
including m1, detect multicollinearity and drop the regressors as needed. 


If variables are close to perfectly collinear, then numerical instability may 
cause problems. We illustrate this by adding the additional regressor extra, 
which equals income plus au, where u is a draw from the uniform 
distribution and a = 0.001 or 0.01. 


First, add the regressor extra, which equals income plus 0.001u. We have 


. * Example with high collinearity interpreted as perfect collinearity 
. generate extra = income + 0.001*runiform() 


. ml model 1f 1lfpois (docvis = private chronic female income extra), vce(robust) 
note: extra omitted because of collinearity. 


Here ml maximize interprets this as perfect collinearity and drops income 
before maximization. 


Next, instead add the regressor ext ra2, which equals income plus the 
larger amount 0.01u. We have 


. * Example with high collinearity not interpreted as perfect collinearity 
. generate extra2 = income + 0.01*runiform() 


. ml model 1f 1lfpois (docvis = private chronic female income extra2), vce(robust) 


. ml maximize, nolog 


Number of obs = 4,412 

Wald chi2(5) = 593.92 

Log pseudolikelihood = -18501.885 Prob > chi2 = 0.0000 
Robust 

docvis Coefficient std. err. Zz P>|z| [95% conf. interval] 

private . 7982791 . 1091528 7.31 0.000 . 5843436 1.012215 

chronic 1.092004 .0559761 19.51 0.000 . 9822934 1.201715 

female . 4922122 .0586045 8.40 0.000 . 3773495 .607075 

income -4.75283 10.0911 -0.47 0.638 -24.53102 15.02536 

extra2 4.7564 10.09136 0.47 0.637 -15.0223 24.5351 

_cons -.2535819 .1154495 -2.20 0.028 -.4798587 -.0273051 


Now this is no longer interpreted as perfect collinearity, so estimation 
proceeds. The coefficients of income and extra2 are very imprecisely 
estimated, while the remaining coefficients and standard errors are close to 
those given in section 13.3.2. 


Pairwise collinearity can be detected by using the correlate command, 
and multicollinearity can be detected with the rmco11 command. For 
example, 


* Detect multicollinearity by using _rmcoll 
_rmcoll income extra 
note: extra omitted because of collinearity. 


_rmcoll income extra2 


Another simple data check is to see whether the parameters of the model 
can be estimated by using a closely related Stata estimation command. For 
the 1fpois program, the obvious test is regression using poisson. Ifa 
command to compute Poisson regression was not available, we could at least 
try OLS regression, using regress. 


16.6.5 Checking parameter estimation 


Once a program runs, we need to check that it is correct. The following 
approach is applicable to estimation with any method, not just with the m1 
command. 


To check parameter estimation, we generate data from the same data- 
generating process (DGP) as that justifying the estimator for a large sample 
size N. Because the desired estimator is consistent as N — oo, we expect the 
estimated parameters to be very close to those of the DGP. A similar exercise 
was done in section 5.6. 


To check a Poisson model-estimation program, we generate data from the 
following DGP: 


Yi = Poisson( 61 + Box); (Bi, B2) = (2, 1); Li ~ N (0, 0.5); t= l, went 10000 


The following code generates data from this DGP: 


. * Generate dataset from Poisson DGP for large N 
. Clear 


. set obs 10000 
Number of observations (_N) was 0, now 10,000. 


. set seed 10101 

. generate x = rnormal(0,0.5) 
. generate mu = exp(2 + x) 

. generate y = rpoisson(mu) 


. Summarize mu x y 


Variable Obs Mean Std. dev. Min Max 
mu 10,000 8.384263 4.430457 .9421767 44.62779 

x 10,000 .0006259 .5038249 -2.059562 1.798357 

y 10,000 8.3369 5.271165 (0) 53 


The normal regressor has the desired mean and variance, and the count 
outcome has a mean of 8.34 and ranges from () to 53. 


We then run the previously defined 1fpois program. Because the DGP is 
known to be Poisson here, we simply use default standard errors. 


. * Consistency check: Run program lfpois and compare beta to DGP value 
. ml model lf lfpois (y = x) 


. ml maximize, nolog 


Number of obs = 10,000 

Wald chi2(1) = 20456.00 

Log likelihood = -23995.411 Prob > chi2 = 0.0000 
y | Coefficient Std. err. z P>l|zl [95% conf. interval] 

x . 9978772 . 006977 143.02 0.000 . 9842026 1.011552 

_cons 1.994867 . 0038788 514.30 0.000 1.987265 2.002469 


The estimates are 3, = 0.998 and 8, = 1.995, quite close to the DGP values 
of 1 and 2. We expect the 95% confidence interval for 3 to include the DGP 
value of 1 in 95% of such simulations. The 95% confidence interval of this 
simulation is [0.984, 1.012], which includes 1. 


If N = 1000000, for example, estimation is more precise, and we expect 
estimates to be very close to the DGP values. 


This DGP is quite simple. More challenging tests would consider a DGP 
with additional regressors from other distributions. 


16.6.6 Checking standard error estimation 


To check that standard errors for an estimator 8 are computed correctly, we 
can perform, say, S$ = 2000 simulations that yield $ estimates 8 and § 
computed standard errors $3. If the standard errors are correctly estimated, 
then the average of the S computed standard errors, 33 = Ss eae 53» 
should equal the standard deviation of the § estimates 6, which is 
(g= De = 8)? where B ced y 8. The sample size needs 
to be large enough that we believe that asymptotic theory provides a good 
guide for computing the standard errors. We set N = 500. 


We first write the secheck program, which draws one sample from the 
same DGP as was used in the previous section. Again, default standard errors 
are used. 


* Program to generate dataset, obtain estimate, and return beta and SEs 
. program secheck, rclass 


1. version 17 

2 drop _all 

3 set obs 500 

4. generate x = rnormal(0,0.5) 
5. generate mu = exp(2 + x) 

6 generate y = rpoisson(mu) 

7 ml model 1f lfpois (y = x) 
8 ml maximize 

9. return scalar b1 =_b[_cons] 
10. return scalar sel = _se[_cons] 
11. return scalar b2 =_b[x] 

12. return scalar se2 = _se[x] 
13. end 


We then run this program 2,000 times, using the simulate command. (The 
postfile command could alternatively be used.) We have 


. * Standard errors check: Run program secheck 
. set seed 10101 


. Simulate "secheck" bcons=r(b1) se_bcons=r(se1) bx=r(b2) se_bx=r(se2), reps(2000) 


Command: secheck 
Statistics: bcons = r(bi1) 
se_bcons = r(se1) 
bx = r(b2) 
se_bx = r(se2) 
. Summarize 
Variable Obs Mean Std. dev. Min Max 
bcons 2,000 2.000752 .0170348 1.930763 2.061243 
se_bcons 2,000 .0172855 . 0002088 .0166568 .0180361 
bx 2,000 . 9998756 .0310063 .9022716 1.113206 
se_bx 2,000 .0310439 .0015591 .0255744 .0361679 


The column obs in the summary statistics here refers to the number of 
simulations (S = 2000). The actual sample size, set inside the secheck 
program, is N = 500. 


For the intercept, we have $3, = 0.0173 compared with 0.0170 for the 
standard deviation of the 2,000 estimates for Bi (cons). For the slope, we 
have 5g, = 0.0310 compared with 0.0310 for the standard deviation of the 
2,000 estimates for Bo (bx). The standard errors are correctly estimated. 


16.7 The ml command: If0—If2, d0—d2, and gf0 methods 


In the following discussion, the terms “log density” and “log likelihood” are 
used, but more generally, the methods apply to m estimators. Then the log- 
likelihood function is the objective function, and log density is the 
contribution of a single observation. In the nonlikelihood case, however, 
robust standard errors should be used. 


The 1£ method is fast and simple when the objective function is of the 
form given in section 16.5.2, a form that Stata manuals refer to as the linear 
form; then parameters appear as a single index. The 1f method does not 
require specification of derivatives. 


The 1£0, 1£1, and 1£2 methods are like method 1¢£ in that they are used 
when the objective function meets the linear-form restrictions and is a sum 
of individual terms for each observation. These methods require you to write 
evaluator programs that are more complicated than those for method 1, in 
exchange for the ability to specify first and second derivatives of the log 
density with methods 1£1 and 1£2. This can speed up computation compared 
with the 1£ and 1£0 methods that rely on numerical derivatives. If you do not 
plan to provide derivatives, there is no point in using method 1£0, because 
method 1£ provides the same functionality and is easier to use. 


The do, a1, and a2 methods are more general than 1£. Again, parameters 
enter through indexes, so the linear form restrictions are used, but now one 
can accommodate situations where there are multiple observations or 
equations for each individual in the sample. This can arise with fixed- and 
random-effects models for panel data and clustered data, with conditional 
multinomial logit models, where regressor values vary over each of the 
potential outcomes, in systems of equations, and in the Cox proportional 
hazards model, where a risk set is formed at each failure. The evaluator 
program for method do defines the log likelihood, whereas method a1 allows 
one to provide analytical expressions for the gradient, and method a2 
additionally allows an analytical expression for the Hessian. 


The evaluators for the 1£* methods provide the log density and, where 
relevant, its derivatives, whereas the a* evaluators provide the objective 


function and, where relevant, its derivatives. In both cases, the objective 
functions and their derivatives are with respect to the indexes x;,,; rather 
than 0;. 


16.7.1 ml evaluator functions 


For the do—d2 and 1£0—1f£2 methods, the syntax is 
ml model method progname eq1 | eq2 | lif | [ in | [ weight | E options | 


where method is d0, d1, d2, 1£0, 1f1, or 1£2; progname is the name of an 
evaluator program; eq1 defines the dependent variables and the regressors 
involved in the first index; eq2 defines the regressors involved in the second 
index; and so on. 


The evaluator program, progname, has five arguments for do, di, and d2 
evaluators: todo, b, Inf, g, and H. The mı command uses the todo argument 
to request no derivative, the gradient, or the gradient and the Hessian. The b 
argument is the row vector of parameters 9. The 1nf argument is the scalar 
objective function Q(@). The g argument is a row vector for the gradient 
JQ (0) /d0', which needs to be provided only for the a1 and a2 methods. The 
H argument is the Hessian matrix 0?Q(0) /0000’, which needs to be 
provided for the a2 method. 


The evaluators for the ao—a2 methods first need to link the parameters @ 
to the indexes x},01, .... This is done with the mleval command, which has 
the syntax 


mleval newvar = vecname Ez eq( #) | 


For example, mleval ‘thetal’=‘b’, eq(1) labels the first index x4;01 as 
thetal. The variables in x1; will be listed in eg/ in m1 model. 


Next, the evaluator needs to compute the objective function Q(@), unlike 
the 1£ method, where the jth entry q;(@) in the objective function is 
computed. The mlsum command sums q;(0) to yield Q(@). The syntax is 


mlsum scalarname_Inf = exp lif] |, noweight ] 


For example, mlsum ‘Inf’=(‘y’-‘thetal’)*2 computes the sum of squared 
` / 2 
residuals $`; (y; — x{,01)°. 


The di and a2 methods require specification of the gradient of the 
overall log likelihood Q(@). For linear-form models, this computation can be 
simplified with the mlvecsum command, which has the syntax 


mlvecsum scalarname_Inf rowvecname = exp lif | E eq( #) | 


For example, mlvecsum ‘Inf’ ‘d1’=‘y’-‘thetal’ computes the gradient 
for the subset of parameters that appears in the first index as the row vector 
yo (Yi — X4,91)X1:. Note that mivecsum automatically multiplies 
‘y’-‘thetal’ by the regressors X1; in the index theta1 because equation 
one is the default when eq() is not specified. 


The 1£0—1£2 methods, like the ao—a2 methods, require you to first use 
mleval to obtain the indexes x1;01, .... Method 1£0 evaluators receive three 
arguments: todo, b, and 1nfj. The variable 1nfj is to be filled in with the 
observation-level log-likelihood q;(0). The variable is named 1nfj, rather 
than infi, because official Stata documentation for the mı command uses j, 
rather than 7, to denote the typical observation. The key difference between 
methods 1f and 1£0 is that the former passes to progname the indexes as 
thetal, theta2, etc., while the latter passes to your program the parameter 
vector b from which you must obtain the indexes yourself. 


The 1£1 and 1£2 methods require specification of the observation-level 
scores associated with the log-likelihood function. That is, the derivatives of 
the log-likelihood function with respect to the indexes x} 01, x2;2, ..-. 
Whereas you specify the derivatives 0Q(@) /00,; with methods a1 and a2, 
methods 1£1 and 1£2 require you to fill in variables containing 
ðq: (0) /Əx;;0;. The predominant advantage of these methods is speed: 
evaluating analytic derivatives is much faster than computing them 
numerically. Moreover, with observation-level first derivatives, m1 can 
compute the robust estimate of the vce, which is not possible with methods 
d0—d2. 


With method 1£1, progname receives 3 + J arguments, where J is the 
number of indexes. For a single-index model, progname will receive four 


arguments: todo, b, Infj, and g1. When todo==1, in addition to filling in 
1nfj with the observation-level log likelihood, you are to fill in the variable 
g1 with 0q;(@)/Ox',,0;. Method 1£2 receives 4 + J arguments; the final 
argument, H, is a matrix to be filled in with the Hessian of the overall log- 
likelihood function when todo==2. 


The a2 and 1£2 methods require specification of the Hessian matrix, the 
default, or the negative Hessian matrix, in which case the negh option is 
added to the m1 model command. For linear-form models, this computation 
can be simplified with the mlmatsum command, which has the syntax 


mlmatsum scalarname_Inf matrizname = exp [ of | E eq( #| »#]) | 


For example, mlmatsum ‘Inf’ ‘dl’=‘thetal’ computes the negative 
Hessian matrix for the subset of parameters that appears in the first index as 
S; X4,91. The mimatsum command automatically multiplies ‘theta1’ by 
X1;iX}; the outer product of the regressors X1; in the index thetal. 


16.7.2 ml methods If0, Ifi and If2 


We consider the cross-sectional Poisson model, a single-index model. For 
multi-index models such as the Weibull and panel data, Cox proportional 
hazards, and conditional logit models, see the Weibull example in [R] ml and 
Gould, Pitblado, and Poi (2010), who also consider complications such as 
how to make ado-files and how to incorporate sample weights. 


An 1£0 method evaluator program for the Poisson MLE is the following: 


. * ml method 1f0: Program 1fOpois gives Inf(y_i) in terms of index x_i“b 
. program 1fOpois 


1. version 17 

2. args todo b lnfi 

3. tempvar thetal // thetal = xb where x given in eq(1) 
4. mleval ~thetal° = `b’, eq(1) 

5. local y $ML_y1 // Define y so program more readable 
6. qui replace `lnfi’ = -exp(*thetail°) + “y°*°thetal° - lnfactorial(`y’) 
7. end 


The code is similar to that given earlier for the 1£ method. We add the 
additional argument todo, which is not used in an 1£0 program but is used in 


1f1 and 1f2 programs. And we add mleval to form x’@. Here there is one 
dependent variable and only one index. 


We specify the dependent variable to be docvis and the regressors to be 
private, chronic, female, and income, plus an intercept. The following 
code requests heteroskedastic-robust standard errors. 


. qui use mus210mepsdocvisyoung, clear 
. qui keep if year02 == 
. generate cons = 1 


. * ml method 1f0: Obtain Poisson MLE with heteroskedastic-robust standard errors 
. ml model 1f0 1fOpois (docvis = private chronic female income), vce(robust) 


. ml maximize 


initial: log pseudolikelihood = -33899.609 
alternative: log pseudolikelihood = -28031.767 
rescale: log pseudolikelihood = -24020.669 
Iteration 0: log pseudolikelihood = -24020.669 
Iteration 1: log pseudolikelihood = -23995.423 
Iteration 2: log pseudolikelihood = -18539.168 
Iteration 3: log pseudolikelihood = -18503.596 
Iteration 4: log pseudolikelihood = -18503.549 
Iteration 5: log pseudolikelihood = -18503.549 

Number of obs = 4,412 

Wald chi2(4) = 594.72 

Log pseudolikelihood = -18503.549 Prob > chi2 = 0.0000 

Robust 

docvis | Coefficient std. err. z P>lz| [95% conf. interval] 

private . 7986654 . 1090015 7.33 0.000 . 5850265 1.012304 

chronic 1.091865 .0559951 19.50 0.000 .9821167 1.201614 

female .4925481 .0585365 8.41 0.000 .3778187 .6072775 

income . 003557 .0010825 3.29 0.001 .0014354 . 0056787 

_cons - .2297264 . 1108733 -2.07 0.038 -.4470339 -.0124188 


The resulting coefficient estimates are the same as those from both the 
poisson command and from the 1f method given in section 16.5.3. The 
robust estimate of the vce is that given in section 13.4.5, with f, computed 
by using numerical derivatives. The standard errors are exactly the same as 
those obtained using the poisson command with the vce (robust) option. 


The 1£1 method evaluator program additionally provides an analytical 
expression for the derivative of the log density of the ith observation with 
respect to the index x};0,. We add the argument gi to the program. We have 


. * ml method 1f1: Program lfipois adds analytical first derivatives 
. program 1fipois 


t: version 17 

2. args todo b lnfi gi 

3. tempvar theta1 // thetal = xb where x given in eq(1) 
4. mleval `theta1” = `b’, eq(1) 

5. local y $ML_y1 // Define y so program more readable 
6. qui replace `lnfi” = -expC theta1”) + `y“*`theta1” - lnfactorialC y”) 
is if (“todo”==0) exit 

8. qui replace “gi” = `y” - exp(*thetal°) // Extra code for robust 

9. end 


The model is run in the same way, with 1£0 replaced by 1f£1 and the 
evaluator function 1£0pois replaced by 1f1pois. Applying this program to 
the same data, but instead obtaining cluster—robust standard errors with 
clustering on age, we obtain 


. * ml method 1f1: Implement Poisson MLE with cluster--robust standard errors 
. ml model 1f1 lfipois (docvis = private chronic female income), vce(cluster age) 


. ml maximize, nolog 


Number of obs = 4,412 
Wald chi2(4) = 657.72 


Log pseudolikelihood = -18503.549 Prob > chi2 = 0.0000 
(Std. err. adjusted for 40 clusters in age) 

Robust 
docvis | Coefficient std. err. Zz P>|z| [95% conf. interval] 
private . 7986654 . 1496492 5.34 0.000 . 5053583 1.091972 
chronic 1.091865 .0603102 18.10 0.000 .9736593 1.210071 
female . 4925481 . 0686028 7.18 0.000 . 3580891 .627007 
income . 003557 .0011792 3.02 0.003 .0012458 . 0058683 
_cons -.2297263 . 1453959 -1.58 0.114 -.514697 0552443 


The 1£2 method evaluator program additionally provides an analytical 
expression for the matrix of second derivatives of the log density of the ith 
observation with respect to the index. This adds three lines of code, using the 
mlmatsum command, that are identical to that given in the a2 method 
evaluator example given in the next subsection. 


16.7.3 ml methods d0, d1, and d2 


A do method evaluator program for the Poisson MLE is the following: 


. * ml method d0: Program 1fOpois gives lnf(y) in terms of index x_i“b 
. program dOpois 


1. version 17 

2. args todo b Inf // todo is not used, b=b, lnf=lnL 

3. tempvar thetal // thetal=x°b given in eq(1) 

4. mleval ~theta1° = `b’, eq(1) 

5. local y $ML_y1 // Define y so program more readable 
6. mlsum “lnf° = -exp(*thetai”) + “y°*°thetail° - lnfactorial(C y’) 

7. end 


The code is similar to that given earlier for the 1£ methods. The major 
difference is that the program calculates the log likelihood for the sample, 
rather than the log density for each observation. The mlsum command forms 
the objective function as the sum of log densities for each observation. 


Applying this program to the same data, we obtain 


. * ml method d0: Poisson MLE with default standard errors only possible 
. ml model dO dOpois (docvis = private chronic female income) 


. ml maximize, nolog 


Number of obs = 4,412 

Wald chi2(4) = 8052.34 

Log likelihood = -18503.549 Prob > chi2 = 0.0000 
docvis Coefficient Std. err. z P>|zl [95% conf. interval] 
private . 7986653 .027719 28.81 0.000 . 7443371 . 8529936 
chronic 1.091865 .0157985 69.11 0.000 1.060901 1.12283 
female .4925481 .0160073 30.77 0.000 .4611744 .5239218 
income .003557 .0002412 14.75 0.000 . 0030844 .0040297 

_cons -, 2297263 .0287022 -8.00 0.000 -.2859815 -.173471 


The resulting coefficient estimates are the same as those from the poisson 
command and those using the 1£ method given in section 16.5.3. 


Note that only default standard errors are reported. This is a consequence 
of the do, di, and a2 programs computing the derivative of the log 
likelihood, rather than the derivative of the log density for each observation. 
This does not provide enough information to compute the middle term in the 
formula for the robust estimate of the vce that is given in section 13.4.5. 


We next consider an example where analytical first and second 
derivatives of the log likelihood are provided. The a2 method evaluator 
program must provide analytical expressions for the gradient Hessian. 


. * ml method d2: Program d2pois adds analytical first and second derivatives 
. program d2pois 


a version 17 

2 args todo b lnf g H // Add g and H to the arguments list 
3 tempvar thetal // thetal = xb where x given in eq(1) 
4. mleval `theta1” = `b’, eq(1) 

5. local y $ML_y1 // Define y so program more readable 
6 mlsum “lnf° = -exp(`theta1”) + `y“*`theta1” - lnfactorial(C y”) 

7 if (‘todo“==0 | “Inf°>=.) exit // di extra code from here 

8. tempname d1 

9. mlvecsum “Inf° ~“d1i° = `y” - expC theta1’) 

10. matrix `g’ = ("dl") 

11. if (‘todo°==1 | “Inf°>=.) exit // d2 extra code from here 

12. tempname dil 

13. mlmatsum “1lnf° ~“d1i1° = exp(~thetal~) 

14. matrix `H” = -`d117 

15. end 


The mlvecsum command forms the gradient row vector 
X fyi — exp(x)B) bx) where x; are the first-equation regressors. The 


. . . IR / 
mlmatsum command forms minus the Hessian matrix X`, exp(x/3)x;x’, 
where x; are the first-equation regressors. 


We obtain 


. * ml method d2: Implement Poisson MLE with default standard errors only possible 
. ml model d2 d2pois (docvis = private chronic female income) 


. ml maximize, nolog 


Number of obs = 4,412 

Wald chi2(4) = 8052.34 

Log likelihood = -18503.549 Prob > chi2 = 0.0000 
docvis Coefficient Std. err. z P>lz| [95% conf. interval] 
private . 7986654 .027719 28.81 0.000 . 7443372 . 8529936 
chronic 1.091865 .0157985 69.11 0.000 1.060901 1.12283 
female . 4925481 .0160073 30.77 0.000 .4611744 .5239218 
income .003557 .0002412 14.75 0.000 . 0030844 . 0040297 

_cons -—.2297263 .0287022 -8.00 0.000 -.2859816 -.1734711 


Again, only default standard errors can be obtained. 


With more than one index, it will be necessary to compute cross- 
derivatives such as ‘d12’. The mlmatbysum command is an extension that 
can be applied when the log likelihood for the jth observation involves a 


grouped sum, such as for panel data. See [R] ml for a two-index example, the 
Weibull MLE. 


16.7.4 ml method gf0 
The gf* methods, like the 1£* methods, provide the log-likelihood 
contribution of each observation. The g£0 program is 

* ml method gf0: Program gfOpois gives lnf(y) in terms of index x_i“b 


. program gfOpois 
1 version 17 


2. args todo b Inf // todo is not used, b=b, lnf=lnL 

3. tempvar thetal // thetal=x“b given in eq(1) 

4. mleval ~“theta1° = `b’, eq(1) 

5. local y $ML_y1 // Define y so program more readable 
6. qui replace “lnf° = -exp(*thetal’) + ~y°**thetal° - Infactorial(y~) 
7. end 


And we obtain 


. * ml method gf0: Implement Poisson MLE with heteroskedastic-robust standard 
> errors 


. ml model gfO gfOpois (docvis = private chronic female income), vce(robust) 


. ml maximize, nolog 


Number of obs = 4,412 


Wald chi2(4) = 594.72 

Log pseudolikelihood = -18503.549 Prob > chi2 = 0.0000 
Robust 

docvis Coefficient std. err. Zz P>lz| [95% conf. interval] 

private . 7986653 .1090015 7.33 0.000 .5850263 1.012304 

chronic 1.091865 .0559951 19.50 0.000 .9821166 1.201614 

female .4925481 .0585365 8.41 0.000 .3778187 .6072774 

income .003557 .0010825 3.29 0.001 .0014354 .0056787 


_cons - .2297263 . 1108733 -2.07 0.038 -. 4470339 -.0124186 


16.8 Nonlinear instrumental-variables (GMM) example 


As an example of a nonlinear estimation problem that cannot be solved using 
the ml command, we consider GMM estimation, or more specifically 
nonlinear instrumental-variables (Iv) estimation, of a Poisson model with 
endogenous regressors. 


The two-stage least-squares interpretation of linear Iv does not extend to 
nonlinear models, so we cannot simply do Poisson regression with the 
endogenous regressor replaced by fitted values from a first-stage regression. 
There are several possible methods to control for endogeneity; see 
section 20.7. We consider use of the nonlinear Iv estimator, introduced in 
section 13.3.9, where estimation used the gmm command. The model can also 
be fit using the ivpoisson gmm command; see section 20.7.3. 


Here we instead illustrate one of the more general optimization 
commands. The objective function is a quadratic form in sums rather than a 
simple sum, so it is not suited for the Stata m1 command. So we use the Mata 
optimize () function. 


16.8.1 Nonlinear IV example 
The Poisson regression model specifies that E{y — exp(x’3)|x} = 0 
because E'(y|x) = exp(x’@). Suppose instead that E{y — exp(x’)|x} 4 0 


because of endogeneity of one or more regressors but that there are 
instruments z such that 


E [zi {yi — exp (x’B)}] = 0 


Then the nonlinear Iv estimator, a GMM estimator, minimizes 


Q(B) = h(B)/Wh(,) (16.4) 


where the r x 1 vector h(@) = >>, z:{y; — exp(x;6)}. This is a special 
case of (13.5) with h(w;, 0) = z;{y; — exp(x/Q)}. 


Define the r x K matrix G(@) = — 5°, exp(x;)z;x/. Then the K x 1 
gradient vector 


g(8) = G(B)'Wh(8) (16.5) 


and the K x K expected Hessian is 
H(6) = G(B)'WG(,)’ 


where simplification has occurred by using E{h()} = 0. 


The estimate of the vce is that in (13.6) with G = G() and 
S = P; {vi — exp(x{8)} zz,- 


16.8.2 GMM using the Mata optimize() function 


The first-order conditions g(3) = 0, where g(6) is given in (16.5), have no 
solution for 3, so we need to use an iterative method. The m1 command is 
not well suited to this optimization because @Q() given in (16.4) is a 
quadratic form. 


Instead, we use the Mata optimize () function. Necessary background 
information on Mata and optimize () is provided in appendixes B and C, 
and it is helpful to also see the Mata OLS example in section 3.9. 


We let W = X, zizi) as for linear two-stage least squares. The 
following Mata expressions form the desired quantities, where we express 
the parameter vector b and gradient vector g as row vectors because the 
optimize () function requires row vectors. We have 


: Xb = X*b~ // b for optimize is 1 x k row vector 

: mu = exp(Xb) 

: h = Z“ (y-mu) // h is r x 1 column vector 

: W = cholinv(Z°Z) // Wis r x r wmatrix 

: G = -(mu:*Z) “X // Gis r x k matrix 

: S = (Cy-mu):*Z) “CCy-mu):*Z) // S is r x r matrix 

: Qb = h°W*h // Q(b) is scalar 

: g = (G°W*h)~ // Gradient for optimize is a 1 x k row vector 
: H = G°W+G // Hessian for optimize is k x k matrix 

: V = luinv(G“W*G) *G “W*S*W*G*luinv (G“W*G) 


We fit a model for docvis, where private 1s endogenous and firmsize 
is used as an instrument, so the model is just identified. We use optimize () 
method a2, where the objective function is given as a scalar and both a 
gradient vector and Hessian matrix are provided. The 
optimize result V robust (Ss) command does not apply to d evaluators, so 
we need to compute the robust estimate of the vcE after optimization. 


The structure of the Mata code is similar to that for the Poisson example 


explained in section C.2.3. We have 


. * optimize() method d2: Evaluator and implement GMM estimator for Poisson 
. mata 


VVV VV VV VV VV OV Me 


Iteration 
Iteration 
Iteration 
Iteration 
Iteration 
Iteration 


mata clear 


void pgmmd2(todo, b, y, X, Z, Qb, g, H) 
{ 
Xb = X*b° 
mu = exp(Xb) 
h = Z° (y-mu) 
W = cholinv(cross(Z,Z)) 
Qb = h°W*h 
if (todo == 0) return 
G = -(mu:*Z) “xX 
g = (G°W*h)~ 
if (todo == 1) return 
H = G°W*G 
_makesymmetric(H) 
} 


st_view(y=., ., "$y") 

st_view(X=., ., tokens("$xlist")) 
st_view(Z=., ., tokens("$zlist")) 

S = optimize_init() 
optimize_init_which(S,"min") 
optimize_init_evaluator(S, &pgmmd2()) 
optimize_init_evaluatortype(S, "d2") 
optimize_init_argument(S, 1, y) 
optimize_init_argument(S, 2, X) 
optimize_init_argument(S, 3, Z) 
optimize_init_params(S, J(1,cols(X) ,0)) 
optimize_init_technique(S, "nr") 


b = optimize(S) 

: £(p) = 71995.212 
f(p) = 9259.0408 
f(p) = 1186.8103 
f(p) = 3.4395408 
f(p) = .00006905 
f(p) = 5.672e-14 


OP WNrF O 


Iteration 6: f(p) = 1.447e-26 


// Compute robust estimate of VCE and SEs 
Xb = X*b~ 


mu = exp(Xb) 

h = Z°(y-mu) 

W = cholinv(cross(Z,Z)) 
G = —(mu:*Z) “X 
n = rows(X) 

k = cols(X) 

Shat = ((y-mu) :*Z) °*((y-mu) :*Z) *rows (n) / (n-k) 


mata (type end to exit) 


Vb = luinv(G°W*G) *G “W*Shat *W*G*luinv (G“W*G) 
seb = (sqrt (diagonal (Vb) )) ~° 
b \ seb 
1 2 3 4 5 
1 1.340291853 1.072907529 -477817773 . 0027832801 - .6832461817 
2 . 0234843641 .0011488761 .0010399796 . 0000330189 . 0203299038 


: end 


The results are the same as those from the gmm command given in the 
section 13.3.10 example. 


More generally, we could include additional instruments, which requires 
changing only the local macro for ‘z1ist’. The model becomes 
overidentified, and GMM estimates vary with choice of weighting matrix W. 
The one-step GMM estimator is 3, given above. The two-step (or optimal) 
GMM estimator recalculates B by using the weighting matrix w — S-1. This 
is illustrated in section 20.7.3 with the gmm command. 


The Mata code is easily adapted to other cases where 
E{y — m(x’B)|z} = 0 for the specified function m/(-), so it can be used, for 
example, for logit and probit models. 


16.9 Additional resources 


The key Stata references are [R] ml and [R] Maximize. Gould, Pitblado, and 
Poi (2010) provide a succinct yet quite comprehensive overview of the m1 
method and Mata-based likelihood evaluators using moptimize() functions. 
Drukker (2016) provides many details on programming in Stata. 

Baum (2016) includes the m1 method and an example using the Mata 
optimize() function. For Mata, see appendixes B and C. 


Nonlinear optimization is covered in Cameron and Trivedi (2005, 
chap. 10), Greene (2018, app. E), Wooldridge (2010, chap. 12.7), and 
Gould, Pitblado, and Poi (2010, chap. 1). GMM is covered in Cameron and 
Trivedi (2005, chap. 5), Greene (2018, chap. 13), and Wooldridge (2010, 
chap. 14). 


16.10 Exercises 


L. 


Nn 


Consider estimation of the logit model covered in section 10.5 and 

chapter 17. Then Q(B) = >>, {yi n A; + (1 — yx) Ai}, where 

A; = A(xiB) = exp(x/,B)/{1 + (xj) }. Show that g(6) = X; (yi — Aa)x: 
and H(3) = >), —A;(1 — A;)x;x;. Hint: OA(z)/Oz = A(z){1 — A(z)} . Use 
the data on docvis to generate the binary variable a_dv for whether there are 
any doctor visits. Using just 2002 data, as in this chapter, use logit to 
perform logistic regression of the binary variable d_dv on private, chronic, 
female, income, and an intercept. Obtain robust estimates of the standard 
errors. You should find that the coefficient of private, for example, equals 
1.27266, with a robust standard error of 0.0896928. 


. Adapt the code of section 16.2.3 to fit the logit model of exercise 1 using NR 


iterations coded in Mata. Hint: In defining an n x 1 column vector with 
entries A,;, you may find it helpful to use the fact that J (n, 1,1) creates an 
n x 1 vector of Is. 


. Adapt the code of section 16.5.3 to fit the logit model of exercise 1 using the 


ml command method 1 ¢. 


. Generate 100,000 observations from the following logit model Dap, 


yi = 1 if 6, + Pox; + u; > 0 and y; = 0 otherwise; (61, 82) = (0,1); x; ~ N(0,1) 


where u; is logistically distributed. Using the inverse transformation method, 
you can compute a draw u from the logistic distribution as 

u = —ln{(1 — r)/r}, where r is a draw from the uniform distribution. Use 
data from this DGP to check the consistency of your estimation method in 
exercise 3 or, more simply, of the logit command. 


. Consider the NLS example in section 16.5.5 with an exponential conditional 


mean. Fit the model using the m1 command and the 1fn1s program. Also, fit 
the model using the nı command, given in section 13.3.6. Verify that these 
two methods give the same parameter estimates but, as noted in the text, the 
robust standard errors differ. 


. Continue the preceding exercise. Fit the model using the mı command and the 


1fnls program with default standard errors. These implicitly assume that the 
NLS model error has a variance of 52 — 1. Obtain an estimate of 

s? = (1/N — K) {ys — exp(x/3)}2, using the predictn1 postestimation 
command to obtain exp(x; ð). Then, obtain an estimate of the vCE by 


10. 


11. 


multiplying the stored result e (v) by 52. Obtain the standard error of E 


and compare this with the standard error obtained when the NLs model is fit 
using the nl command with a default estimate of the VCE. 


. Consider a Poisson regression of docvis on the regressors private, chronic, 


female, and income and the programs given in section 16.7. Run the m1 

model d0 d0pois command, and confirm that you get the same output as 
produced by the code in section 16.7.3. Confirm that the nonrobust standard 
errors are the same as those obtained using poisson with default standard 
errors. Run ml model d1 dlpois, and confirm that you get the same output as 
produced by the code in section 16.7.3. Run ml model d2 d2pois, and confirm 
that you get the same output as that given in section 16.7.3. 


. Adapt the code of section 16.7.3 to fit the logit model of exercise 1 by using 


ml command method do. 


. Adapt the code of section 16.7.2 to fit the logit model of exercise 1 by using 


m1 command method 1£1 with robust standard errors reported. 

Adapt the code of section 16.7.3 to fit the logit model of exercise 1 by using 
ml command method a2. 

Consider the negative binomial example given in section 16.5.4. Fit this same 
model by using the m1 command method ao. Hint: See the Weibull example 
in [R] ml. 


Chapter 17 
Binary outcome models 


17.1 Introduction 


Regression analysis of a qualitative binary or dichotomous variable is a 
commonplace problem in applied statistics. Models for mutually exclusive 
binary outcomes focus on the determinants of the probability p of the 
occurrence of one outcome rather than an alternative outcome that occurs 
with a probability of 1 — p. An example where the binary variable is of 
direct interest is modeling whether an individual has insurance. In 
regression analysis, we want to measure how the probability p varies across 
individuals as a function of regressors. A different type of example is 
predicting the propensity score p, the conditional probability of 
participation (rather than nonparticipation) of an individual in a treatment 
program. In the treatment-effects literature, this prediction given observable 
variables is an important intermediate step, even though ultimate interest 
lies in outcomes of that treatment. 


For regression with a continuous dependent variable, the standard 
model is the linear model. For binary outcome data, there are two standard 
parametric models—the logit model and the probit model. These specify 
different functional forms for p as a function of regressors, and the models 
are fit by maximum likelihood (ML). These models are nonlinear, making 
direct interpretation of parameters more difficult. The resulting marginal 
effects (MEs) and predicted probabilities, however, are in practice similar 
across the two models. The logit model is the most commonly used model 
for binary outcomes in applied statistics, while economists additionally use 
the probit model. The linear probability model (LPM), fit by ordinary least 
squares (OLS), is also used at times. 


The logit and probit models were introduced in chapter 10 as examples 
of nonlinear regression. This chapter repeats some material from chapter 10 
while also providing further detail on the estimation and interpretation of 
cross-sectional binary outcome models. Because of their nonlinear 
functional forms, focus in interpretation is on MEs of regressors rather than 
their coefficients. Extensions of the standard model include endogenous 
regressors and clustered and grouped data. 


17.2 Some parametric models 


Different binary outcome models have a common structure. The dependent 
variable, yi, takes only two values, so its distribution is unambiguously the 
Bernoulli, or binomial with one trial, with a probability of pi. Logit and 
probit models correspond to different regression models for Pi. 


17.2.1 Basic model 


Suppose the outcome variable, y, takes one of two values: 


= l 1 with probability p (17.1) 


0 with probability 1 — p 


Given our interest in modeling p as a function of regressors x, there is no 
loss of generality in setting the outcome values to 1 and 0. The probability 
mass function for the observed outcome, y, is p¥(1 — p)'~¥, with E(y) = p 
and Var(y) = p(1 — p). 


A regression model is formed by parameterizing p to depend on an index 
function x’, where x isa K x 1 regressor vector and 8 is a vector of 
unknown parameters. In standard binary outcome models, the conditional 
probability has the form 


pi = Pr(y = 1|x) =F (208) 


where F'(-) is a specified parametric function of x’ 8, usually a cumulative 
distribution function (c.d.f.) on (—oo, co) because this ensures that the 
bounds 0 < p < 1 are satisfied. Note that in this model, 


E(yi|x:) = F(x'B) (17.2) 
Var(yi|x:) = F(x’ B){1 — F(x'B)} 


17.2.2 Logit, probit, linear probability, and complementary log-log 
models 


Models differ in the choice of function, F (-). Four commonly used 


functional forms for F'(x’@), shown in table 17.1, are the logit, probit, linear 
probability, and complementary log—log forms. 


Table 17.1. Four commonly used binary outcome models 


Model Probability p = Pr(y = 1|x) ME Op/0x; 
Logit A(x!B) = e* P/(1 + e* 8) A(x’B){1 — A(x'B)}8; 
Probit ®(x’/B) = fee b(z)dz b(x’B)B; 
Complementary 

log-log C(x'B) = 1— exp{—exp(x’B)} —exp{—exp(x’@)  exp(x’B)5; 
Linear probability F'(x’G) =x’B bj 


The logit model specifies that F'(-) = A(-), the c.d.f. of the logistic 
distribution. The probit model specifies that F(-) = ®(-), the standard 
normal c.d.f. Logit and probit functions are symmetric around zero and are 
widely used in microeconometrics. The complementary log—log model is 
asymmetric around zero. Its use is sometimes recommended when the 
distribution of y is skewed such that there is a high proportion of either zeros 
or ones in the dataset. The LPM corresponds to linear regression and does not 
impose the restriction that 0 < p < 1. 


The last column in the table gives expressions for the corresponding MEs, 
used in section 17.6, where ¢(.) denotes the standard normal density. 


17.2.3 Latent-variable interpretation and identification 


Binary outcome models can be given a latent-variable interpretation, though 
this is not necessary. The advantage of doing so 1s that it provides a link with 
the linear regression model, explains more deeply the difference between 
logit and probit models, and provides the basis for extension to some 
multinomial models given in chapter 18. 


We distinguish between the observed binary outcome, y, and an 
underlying continuous unobservable (or latent) variable, y*, that satisfies the 
single-index model 


y* =x’ ßB+u (17.3) 


Although y* is not observed, we do observe 


(17.4) 


where the zero threshold is a normalization that is of no consequence if x 
includes an intercept. 


Given the latent-variable models (17.3) and (17.4), we have 


Pr(y = 1) = Pr (x'ß + u > 0) 
= Pr (u > —x'ß) 
= 1 — Pr (u < —x’B) 
= Pr (u <x'B) 
= F (xP) 


where the second to last equality assumes that u is symmetrically 
distributed about zero and F (-) is the c.d.f. of u. This yields the probit model 
if u is standard normally distributed and the logit model if u is logistically 
distributed. 


Identification of the latent-variable model requires that we fix its scale 
by placing a restriction on the variance of u, because the single-index model 
can identify 8 only up to scale. An explanation for this is that we observe 
only whether y* = x’3 + u > 0. But this is not distinguishable from the 
outcome x’3* + ut > 0, where GT = aß and y+ = au for any a > 0. We 


can identify only G/o, where o is the standard deviation (scale parameter) of 
u. 


To uniquely define the scale of 8, we set g = 1 in the probit model and 
T //3 in the logit model. Thus, 8 is scaled differently in the two models; see 
section 17.4.4. 


17.3 Estimation 


For parametric models with exogenous covariates, the maximum likelihood 
estimator (MLE) is the natural estimator because the density is 
unambiguously the Bernoulli. Stata provides ML procedures for logit, probit, 
and complementary log—log models and for several variants of these 
models. For models with endogenous covariates, instrumental-variables (Iv) 
methods can instead be used; see section 17.9. 


17.3.1 ML estimation 


For binary models other than the LPM, estimation is by ML. This ML 
estimation is straightforward. The density for a single observation can be 
compactly written as p? (1 — p;)'~¥, where p; = F'(x/). For a sample of 
N independent observations, the MLE, Â, maximizes the associated log- 
likelihood function 


N 


Q(B) = X [yin F(x;8) + (1 — ys) n{1 — F(x;8)} (17.5) 


i=1 


The MLE is obtained by iterative methods and is asymptotically normally 
distributed. 


Consistent estimates of G are obtained if F'(x/,3) is correctly specified. 
When instead it is misspecified, because the functional form F'(-) is 
misspecified or because x/ 68 is misspecified, quasi-ML theory applies. Thus, 
we use robust standard errors that nonetheless provide consistent estimates 
of the variance of 7. 


17.3.2 The logit and probit commands 


The syntax for the logit command is 


logit depvar | indepvars | [ of | [ in | [ weight | [ x options | 
The syntaxes for the probit and cloglog commands are similar. 


Like the regress command, available options include vce (cluster 
clustvar) and vce (robust) for variance estimation. The constant is 
included by default but can be suppressed by using the noconstant option. 


The or option of logit presents exponentiated coefficients. The 
rationale is that for the logit model, the log of the odds ratio In{p/(1 — p)} 
can be shown to be linear in x and ZZ. It follows that the odds ratio 
p/(1 — p) = exp(x’B), so that e8; measures the multiplicative effect of a 
unit change in regressor £; on the odds ratio. Thus, many researchers prefer 
logit coefficients to be reported after exponentiation, that is, as e8 rather 
than 6. Alternatively, the logistic command estimates the parameters of 
the logit model and directly reports the exponentiated coefficients. 


17.3.3 Robust estimate of the variance—covariance matrix of the 
estimator 


Binary outcome models are unusual in that in theory there is no advantage 
in using the robust sandwich form for the variance—covariance matrix of the 
estimator (VCE) of the MLE if data are independent over ; and F(x’ 68) is 
correctly specified. The reason is that the ML default standard errors are 
obtained by imposing the restriction Var(y|x) = F(x’B){1— F(x’B)}, 
and this must necessarily hold because the variance of a binary variable is 
always p(1 — p); see Cameron and Trivedi (2005, 469) for further 
explanation. If F’(x’@) is correctly specified, the vce (robust) option is not 
required. 


In practice, F (x’6) is likely to be misspecified because the functional 
form F(-) is misspecified or because x; is misspecified. Then quasi-ML 
theory applies (see section 13.3.1), and we need to use robust standard 
errors that nonetheless provide consistent estimates of the variance of 3. So 
we generally obtain robust standard errors. 


Note also that for binary outcomes in the special case of independent 
observations we may infer that the functional form F(x’ 68) is misspecified 
if the vce (robust) option produces substantially different variances from 
the default. 


When observations are dependent because of clustering, the appropriate 
option is to use vce (cluster clustvar), even if F(x’) is correctly 
specified. 


17.3.4 OLS estimation of linear probability model 


If F(-) is assumed to be linear, that is, p = x’@, then the linear conditional 
mean function defines the LPM. The LPM can be consistently fit by OLS 
regression of y on x using regress. A major limitation of the method, 
however, is that the fitted values x’@ will not necessarily be in the [0, 1] 
interval. And because Var(y|x) = (x’@)(1 — x’@) for the LPM, the 
regression is inherently heteroskedastic, so a robust estimate of the VCE 
should be used. 


17.3.5 Estimation with proportions or fractional response data 


Proportions data or fractional response data are continuous data that lie 
between zero and one, such as county-level data on the proportion of people 
with private insurance or on the fraction of income spent on food. 


Because y lies between zero and one, its conditional mean should lie 
between zero and one. A natural model is then F'(y;|x;) = F'(x/), where 
F(-) is the function corresponding to a binary outcome model such as the 
logit or the probit model. This model can be fit by nonlinear least squares 
(NLS); an n1 command for probit is given in section 10.6.2. Inference should 
be based on robust standard errors because the error in the NLS model is 
unlikely to be homoskedastic. 


Efficiency gains may be possible by additionally modeling the 
heteroskedasticity. A natural starting point is 
Var(yi|x;) = F'(x,B) x {1 — F(x:6)}, as in (17.2). One might use 
feasible generalized NLS. Simpler is to use the fracreg command, which 


uses the same objective function (17.5) as that for a binary outcome MLE. In 
either case, robust standard errors need to be used to guard against 
misspecification of the model for heteroskedasticity. 


Further discussion and examples are given in section 17.10. 


17.4 Example 


We analyze data on supplementary health insurance coverage. Initial 
analysis estimates the parameters of the models of section 17.2. 


17.4.1 Data description 


The data come from wave 5 (2002) of the Health and Retirement Study 
(HRS), a panel survey sponsored by the National Institute of Aging. The 
sample is restricted to Medicare beneficiaries. The HRS contains information 
on a variety of medical service uses. The elderly can obtain supplementary 
insurance coverage either by purchasing it themselves or by joining 
employer-sponsored plans. We use the data to analyze the purchase of 
private insurance (ins) from any source, including private markets or 
associations. The insurance coverage broadly measures both individually 
purchased and employer-sponsored private supplementary insurance and 
includes Medigap plans and other policies. 


The explanatory variables include health status, socioeconomic 
characteristics, and spouse-related information. Self-assessed health-status 
information is used to generate a dummy variable (nstatusg) that measures 
whether health status is good, very good, or excellent. Other measures of 
health status are the number of limitations (up to five) on activities of daily 
living (ad1) and the total number of chronic conditions (chronic). 
Socioeconomic variables used are age, gender, race, ethnicity, marital status, 
years of education, and retirement status (respectively, age, female, white, 
hisp, married, educyear, retire); household income (hhincome); and log 
household income if positive (1inc). Spouse retirement status (sretire) is 
an indicator variable equal to 1 if a retired spouse is present. 


For conciseness, we use global macros to create variable lists, presenting 
the variables used in sections 17.4—17.6 followed by additional variables 
used in section 17.9. We have 


. * Read in data, define globals, and summarize key variables 
. qui use mus217hrs 


. global xlist age hstatusg hhincome educyear married hisp 
. global extralist female white chronic adl sretire 


summarize ins retire $xlist $extralist 


Variable Obs Mean Std. dev. Min Max 
ins 3,206 . 3870867 .4871597 (0) 1 
retire 3,206 .6247661 . 4842588 (0) 1 
age 3,206 66.91391 3.675794 52 86 
hstatusg 3,206 . 7046163 . 4562862 (0) 1 
hhincome 3,206 45.26391 64.33936 (0) 1312.124 
educyear 3,206 11.89863 3.304611 0 17 
married 3,206 . 7330006 -442461 (0) 1 
hisp 3,206 .0726762 . 2596448 (0) 1 
female 3,206 .477854 .4995872 (0) 1 
white 3,206 . 8206488 . 383706 (0) 1 
chronic 3,206 2.063319 1.416434 (0) 8 
adl 3,206 . 301622 . 8253646 (0) 5 
sretire 3,206 . 3883344 . 4874473 (0) 1 


17.4.2 Logit regression 


We begin with ML estimation of the logit model, with heteroskedastic—robust 
standard errors computed. 


* Logit regression 
. logit ins retire $xlist, vce(robust) 


Iteration 0: log pseudolikelihood = -2139.7712 


Iteration 1: log pseudolikelihood = -1996.7434 
Iteration 2: log pseudolikelihood = -1994.8864 
Iteration 3: log pseudolikelihood = -1994.8784 
Iteration 4: log pseudolikelihood = -1994.8784 
Logistic regression Number of obs = 3,206 
Wald chi2(7) = 256.49 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -1994.8784 Pseudo R2 = 0.0677 
Robust 
ins | Coefficient std. err. Zz P>lz| [95% conf. interval] 
retire . 1969297 . 0849524 2.32 0.020 .0304261 . 3634332 
age -.0145955 .0110189 -1.32 0.185 -.0361921 .007001 
hstatusg . 3122654 .0918378 3.40 0.001 . 1322667 . 4922641 
hhincome . 0023036 .0011485 2.01 0.045 . 0000526 . 0045546 
educyear . 1142626 .0143616 7.96 0.000 .0861143 . 1424108 
married . 578636 0941579 6.15 0.000 . 39409 . 7631821 
hisp -.8103059 . 1936793 -4.18 0.000 -1.18991  -.4307016 
_cons -1.715578 . 7279873 -2.36 0.018 -3.142407 -.2887495 


All regressors other than age are statistically significantly different from 
zero at the 0.05 level. For the logit model, the sign of the coefficient is also 
the sign of the ME. Further discussion of these results is deferred to the next 
section, where we compare logit parameter estimates with those from other 
models. 


The iteration log shows fast convergence in four iterations. Later output 
suppresses the iteration log to save space. In actual empirical work, it is best 
to keep the log. For example, many iterations may signal a high degree of 
multicollinearity. 


17.4.3 Coefficient interpretation 


The logit (and probit) models are single-index models (see section 13.7.3), 
so if one coefficient is twice as big as the other, then the effect of a unit 
change in the corresponding regressor is twice as big as the other. 
Furthermore, the sign of the coefficient gives the sign of the effect because 
F'(-) > 0. 


Thus, one more year of education is associated with an increase in the 
probability of having private insurance. And being married also has a 
positive effect that is roughly equivalent to the effect of 5 more years of 
schooling because 0.5786/0.1143 = 5.06. 


MES are presented in section 17.6. A rough rule of thumb for the logit 
model is that the average marginal effect (AME) of a change in the jth 
regressor equals y(1 — y)8;. So one more year of education is associated 
with an increase of 0.387 x 0.613 x 0.114 = 0.027 of having higher 
insurance. For the probit model, a similar approximation is that the ME 1s at 
most 0.4 x §;. 


Finally, from section 17.3.2, the logit estimates imply that one more year 
of education is associated with an ¢9-114 times higher odds ratio (of having 
private insurance versus not having private insurance). 


17.4.4 Comparison of binary models and parameter estimates 


It is well known that logit and probit models have similar shapes for central 
values of F (-) but differ in the tails as F'(-) approaches 0 or 1. At the same 
time, the corresponding coefficient estimates from the two models are scaled 
quite differently. It is an elementary mistake to suppose that the different 
models have different implications simply because the estimated coefficients 
across models are different. However, this difference is mainly due to 
different functional forms for the probabilities. The MEs and predicted 
probabilities, presented in sections 17.5 and 17.6, are much more similar 
across models. 


Coefficients can be compared across models, using the following rough 
conversion factors (Amemiya 1981, 1,488): 


Brogit ~ 4801s 


BProbit = 2-080 Ls 


BLogit = 1.68 pyopitt 


The motivation is that it is better to compare the ME, Op/Ox;, across 
models, and it can be shown that ðp/ðx; < 0.258, for logit, 
Op/Ox; < 0.48; for probit, and ðp/ðx 3 B; for oLs. The greatest 
departures across the models occur in the tails. 


We estimate the parameters of the logit and probit models by ML and the 
LPM by OLS, computing standard errors and z statistics based on both default 
and robust estimates of the vce. The following code saves results for each 
model with the estimates store command. 


. * Comparison of estimates for logit, probit and LPM models 
. gui logit ins retire $xlist 


. estimates store blogit 

. qui probit ins retire $xlist 

. estimates store bprobit 

. qui regress ins retire $xlist 

. estimates store bols 

. qui logit ins retire $xlist, vce(robust) 

. estimates store blogitr 

. gui probit ins retire $xlist, vce(robust) 
. estimates store bprobitr 

. qui regress ins retire $xlist, vce(robust) 


. estimates store bolsr 


This leads to the following output table of parameter estimates across the 
models: 


. * Table for comparing models 
. estimates table blogit blogitr bprobit bprobitr bols bolsr, 


> t stats(N 11) b(%7.3f) stfmt(%8.2f) eq(1) 

Variable blogit blogitr bprobit bprobitr bols bolsr 
retire 0.197 0.197 0.118 0.118 0.041 0.041 
2.34 2.32 2.31 2.30 2.24 2.24 
age -0.015 -0.015 -0.009 -0.009 -0.003 -0.003 
=1.29 -1.32 -1.29 -1.32 -1.20 -1.25 
hstatusg 0.312 0.312 0.198 0.198 0.066 0.066 
3.41 3.40 3.56 3.57 3.37 3.45 
hhincome 0.002 0.002 0.001 0.001 0.000 0.000 
3.02 2.01 3.19 2.21 3.58 2.63 
educyear 0.114 0.114 0.071 0.071 0.023 0.023 
8.05 7.96 8.34 8.33 8.15 8.63 
married 0.579 0.579 0.362 0.362 0.123 0.123 
6.20 6.15 6.47 6.46 6.38 6.62 
hisp -0.810 -0.810 -0.473 -0.473 -0.121 -0.121 
-4.14 -4.18 -4.28 -4.36 -3.59 -4.49 
_cons -1.716 -1.716 -1.069 -1.069 0.127 0.127 
-2.29 -2.36 -2.33 -2.40 0.79 0.83 
N 3206 3206 3206 3206 3206 3206 


11 -1994.88 -1994.88 -1993.62 -1993.62 -2104.75 -2104.75 


Legend: b/t 


The coefficients across the models tell a qualitatively similar story about the 
impact of a regressor on Pr(ins = 1). The rough rules for parameter 
conversion also stand up reasonably well because the logit estimates are 
roughly five times the OLS estimates and the probit estimates are roughly 
three times the OLS coefficients. The standard errors are similarly rescaled, so 
that the reported z statistics for the coefficients are similar across the three 
models. For the logit and probit coefficients, the robust and default z 
statistics are quite similar, aside from those for the hhincome variable. For 
OLs, which does not allow for the intrinsic heteroskedasticity of binary data, 
there is a bigger difference. 


In section 17.5, we will see that the fitted probabilities are similar for the 
logit and probit specifications. The linear functional form does not constrain 
the fitted values to the [0, 1] interval, however, and we find differences in the 
fitted-tail values between the LPM and the logit and probit models. 


17.4.5 Wald tests 


Tests on coefficients of variables are most easily performed by using the 
test command, which implements a Wald test. For example, we may test for 
the presence of interaction effects with age, with all regressors aside from 
retire interacted with age. The null hypothesis is that the coefficients of the 
additional regressors are all zero because then there are no interaction 
effects. 


To avoid the need to create new variables, we use the factor-variable 
binary operator # to define the interaction variables. To perform the 
subsequent Wald test, we use the testparm command rather than test. We 
obtain 


. * Wald test for no interactions 
. global intlist c.aget#tc.age c.age#i.hstatusg c.age#c.hhincome 
> c.age#c.educyear c.age#i.married c.age#i.hisp 


. qui logit ins retire $xlist $intlist, vce(robust) nolog 


. testparm $intlist 


( 1) [ins]c.age#c.age = 0 
( 2) [ins]1.hstatusg#c.age = 0 
( 3) [ins]c.age#c.hhincome = 0 
( 4) [ins]c.age#c.educyear = 0 
( 5) [ins]1.married#c.age = 0 
( 6) [ins]1.hisp#c.age = 0 
chi2( 6) = 17.16 
Prob > chi2 = 0.0087 


The p-value is 0.0087, so the null hypothesis is rejected at the 0.01 level. 
17.4.6 Likelihood-ratio tests 


A likelihood-ratio (LR) test (see section 11.4) provides an alternative method 
for testing hypotheses, provided default standard errors are used. It is 
asymptotically equivalent to the Wald test based on default standard errors if 
the model is correctly specified. 


To implement the LR test of the preceding hypothesis, we estimate 
parameters of both the general and the restricted models with default 
estimates of the vcE and then use the 1rtest command. We obtain 


. * LR test for no interactions 
. gui logit ins retire $xlist $intlist 


. estimates store B 


. qui logit ins retire $xlist 


. lrtest B 

Likelihood-ratio test 
Assumption: . nested within B 
LR chi2(6) = 16.10 

Prob > chi2 = 0.0132 


This test has a p-value of 0.0132, so the null hypothesis is rejected at the 
0.05 level but not at the 0.01 level. Wald and LR tests do not yield identical 
results. 


In some situations, the main focus is on the predicted probability of the 
model, and the sign and size of the coefficients are not the focus of the 
inquiry. An example is the estimation of propensity scores, in which case a 
recommendation is often made to saturate the model and then to choose the 
best model by using the Bayesian information criterion (BIC). 


17.4.7 Model comparison 


A question often arises: which model is better, logit or probit? As will be 
seen in the next section, in many cases, the fitted probability is very similar 
over a large part of the range of x’. Larger differences may be evident in 
the tails of the distribution, but a large sample is required to reliably 
differentiate between models on the basis of tail behavior. 


Because logit and probit models are nonnested, a penalized likelihood 
criterion such as Akaike’s information criterion or BIC (see section 13.8.2) is 
appealing for model selection. However, these two models have the same 
number of parameters, so this reduces to choosing the model with the higher 
log likelihood. The probit model has a log likelihood of — 1,993.62 (see the 
table in section 17.4.4), which is 1.26 higher than the — 1,994.88 for logit. 
This favors the probit model, but the difference is not great. For example, an 
LR test of a single restriction rejects at the 0.05 level if the LR statistic 
exceeds 3.84 or, equivalently, if the change in log likelihood is 
3.84/2 = 1.92. 


17.4.8 Generalized linear model estimation 


The logit and probit models are examples of generalized linear models, and 
the g1m command can yield identical results to the logit and probit 
commands. 


For example, to obtain logit ML estimates with heteroskedastic—robust 
standard errors, we give the command 


* Logit command results duplicated using the glm command 
. glm ins $xlist, link(logit) family(binomial) vce(robust) 


(output omitted ) 


If the family (binomial) option is dropped, we instead obtain NLS 


estimates for the logit model. We have 


. * NLS estimation of logit model using the glm command 
. glm ins retire $xlist, link(logit) vce(robust) nolog 


Generalized linear models Number of obs 3,206 

Optimization : ML Residual df = 3,198 

Scale parameter = .2181226 

Deviance = 697.5562266 (1/df) Deviance = . 2181226 

Pearson = 697.5562266 (1/df) Pearson = . 2181226 
Variance function: V(u) = 1 [Gaussian] 

Link function : g(u) = In(u/(1-u)) [Logit] 
AIC = 1.317671 
Log pseudolikelihood = -2104.227411 BIC = -25119.19 
Robust 

ins | Coefficient std. err. Zz P>|z| [95% conf. interval] 

retire . 2079366 .086513 2.40 0.016 .0383743 . 3774988 

age -.01465 .0107263 -1.37 0.172 -.0356731 .0063732 

hstatusg . 2473319 .0907609 2.73 0.006 . 0694438 .42522 

hhincome .0045752 .0020291 2.25 0.024 .0005983 .0085522 

educyear .0994968 .0151952 6.55 0.000 .0697147 . 1292788 

married . 4839629 . 1001032 4.83 0.000 . 2877643 .6801616 

hisp -.7191678 . 1941781 -3.70 0.000 -1.09975 - . 3385857 

_cons -1.495278 . 7061223 -2.12 0.034 -2.879252 -.111304 


The estimated coefficients are generally within 20% of the logit estimates 
and have slightly higher standard errors. 


This last estimator can be used with proportions data and yields fitted 
values of the dependent variable that necessarily lie between zero and one; 
see sections 17.3.5 and 17.10. 


17.5 Goodness of fit and prediction 


The Stata output for the logit and probit regressions has a similar format. 
The log likelihood and the LR test of the joint significance of the regressors 
and its p-value are given. However, some measures of overall goodness of fit 
are desirable, including those that are specific to the binary outcome model. 


Three approaches to evaluating the fit of the model are pseudo- R? 
measures, comparisons of group-average predicted probabilities with sample 
frequencies, and comparisons based on classification (y equals zero or one). 
None of these is the most preferred measure a priori. Below, we discuss 
comparisons of model fit using predicted probabilities. 


17.5.1 Pseudo-R2 measures 


In linear regression, the total sum of squared deviations from the mean can 
be decomposed into explained and residual sums of squares, and R2 
measures the ratio of explained sum of squares to total sum of squares, with 
0 and 1 as the lower and upper limits, respectively. These properties do not 
carry over to nonlinear regression. Yet there are some measures of fit that 
attempt to mimic the R2 measure of linear regression. There are several R2 
measures, one of which is included in the Stata output. 


McFadden’s 722 is computed as 1 — L n(B) [Ln (g) where ZJ, n(B) 
denotes the maximized or fitted log-likelihood value and L y (y) denotes the 
value of the log likelihood in the intercept-only model. When applied to 
models with binary and multinomial outcomes, the lower and upper bounds 
of the pseudo- R2 measure are (0 and 1 (see section 13.8.1), though 
McFadden’s 722 is not a measure of the proportion of variance of the 
dependent variable explained by the model. For the fitted logit model, 


R? = 0.068- 
17.5.2 Comparing predicted frequencies with sample frequencies 


In-sample comparison of the average predicted probabilities, N-t Y 9;, 
with the sample frequency, ¥, is not helpful for evaluating the fit of binary 


outcome models. In particular, the two are necessarily equal for logit models 
that include an intercept because the logit MLE first-order conditions can be 
shown to then impose this condition. 


However, this comparison may be a useful thing to do for subgroups of 
observations. The Hosmer—Lemeshow specification test evaluates the 
goodness of fit by comparing the sample frequency of the dependent 
variable with the predicted frequency based on the fitted probability within 
subgroups of observations, with the number of subgroups being specified by 
the investigator. The null hypothesis is that the two are equal. The test is 
similar to the Pearson chi-squared goodness-of-fit test. 


Let Dy and Yg denote, respectively, the average predicted probability and 
sample average frequency in group g9. The test statistic is 
ys (Po — Yq)"/Yq(1 — Fg), where g is the group subscript. The groups 
are based on quantiles of the ordered predicted probabilities. For example, if 
G = 10, then each group corresponds to a decile of the ordered p;. Hosmer 
and Lemeshow established the null distribution by simulation. Under the 
null of correct specification, the statistic is distributed as y? (G — 2). 
However, two caveats should be noted: First, the test outcome is sensitive to 
the number of groups used in the specification. Second, much of what is 
known about the properties of the test is based on Monte Carlo evidence of 
the test’s performance; see Hosmer and Lemeshow (1980) and Hosmer, 
Lemeshow, and Sturdivant (2013). Simulation evidence suggests that a fixed 
sample size specifying many groups in the test causes a divergence between 
the empirical c.d.f. and the c.d.f. of the y? (G — 2) distribution. For the 
correct form of the chi-squared goodness-of-fit test, see section 11.9.3. 


The goodness-of-fit test is performed by using the estat gof 
postestimation command, which has the syntax 


estat gof [ of | [ in | [ weight | la options | 


where the group (#) option specifies the number of quantiles to be used to 
group the data, with 10 being the default. This test is highly parametric and 
can be used only if the initial model estimation uses default standard errors. 


After estimating the parameters of the logit model, we perform this test, 
setting the number of groups to four. We obtain 


. x Hosmer--Lemeshow goodness-of-fit test with 4 groups 
. qui logit ins retire $xlist 


. estat gof, group(4) table // Hosmer--Lemeshow goodness-of-fit test 
note: obs collapsed on 4 quantiles of estimated probabilities. 


Goodness-of-fit test after logistic model 
Variable: ins 


Table collapsed on quantiles of estimated probabilities 


Group Prob Obs_1 Exp_1 Obs_O | Exp_O | Total 
1 | 0.2849 146 161.2 656 | 640.8 802 
2 | 0.3994 291 | 274.6 510 | 526.4 801 
3 | 0.4779 387 | 355.0 415 | 447.0 802 
4 | 0.9650 417 | 450.1 384 | 350.9 801 


Number of observations = 3,206 
Number of groups = 4 
Hosmer-—Lemeshow chi2(2) = 14.04 
Prob > chi2 = 0.0009 


The first row of the table, for example, shows that for the 802 observations 
with predicted probabilities less than the lower quartile predicted probability 
of 0.2849, there were 146 cases of y = 1, while 802 times the average 
predicted probability equals 161.2, an overprediction of 15.2. Across the 
four rows, the model overpredicts cases of y = 1 in the lower and upper 
quartiles of the predicted probabilities and underpredicts in the interquartile 
range. The outcome indicates misspecification because the p-value is 0.001. 


To check whether the same outcome occurs if we use many groups to 
perform the test, we repeat the test for 10 groups. 


* Hosmer--Lemeshow goodness-of-fit test with 10 groups 
. estat gof, group(10) // Hosmer--Lemeshow goodness-of-fit test 
note: obs collapsed on 10 quantiles of estimated probabilities. 


Goodness-of-fit test after logistic model 
Variable: ins 


Number of observations = 3,206 
Number of groups = 10 
Hosmer-—Lemeshow chi2(8) = 31.48 
Prob > chi2 = 0.0001 


Again, the test rejects the maintained specification, this time with an even 
smaller p-value. 


17.5.3 Comparing predicted outcomes with actual outcomes 


The preceding measure is based on the fitted probability of having private 
insurance. We may instead want to predict the outcome itself, that is, 
whether an individual has private insurance (7 = 1) or does not have 
insurance (y = 0). Strictly speaking, this depends upon a loss function. If we 
assume a symmetric loss function, then it is natural to set 7 = 1 if 

F(x'B) > 0.5 and y = 0 if F(x’G) < 0.5. One measure of goodness of fit is 
the percentage of correctly classified observations. 


Goodness-of-fit measures based on classification can be obtained by 
using the estat classification postestimation command. 


For the fitted logit model, we obtain 


. * Comparing fitted probability and dichotomous outcome 
. gui logit ins retire $xlist 


. estat classification 


Logistic model for ins 


True 
Classified D “D Total 
+ 653 
= 2553 
Total 1241 1965 3206 

Classified + if predicted Pr(D) >= .5 

True D defined as ins != 0 
Sensitivity Pr( +| D) 27.80% 
Specificity Pre -I7D) 84.33% 
Positive predictive value Pr( DI +) 52.83% 
Negative predictive value Pr(~D| -) 64.90% 
False + rate for true ~D Pr( +|~D) 15.67% 
False - rate for true D Pr( -| D) 72.20%, 
False + rate for classified + Pr(~D| +) 47.17% 
False - rate for classified - Pr( D| -) 35.10% 
Correctly classified 62.45% 


The table presents a “confusion matrix” that compares fitted and actual 
values. The percentage of correctly specified values in this case is 62.45. In 
this example, 308 observations are misclassified as 1 when the correct 
classification is 0, and 896 values are misclassified as 0 when the correct 
value is 1. The remaining 345 + 1657 observations are correctly specified. 


The estat classification command also produces detailed output on 
classification errors, using terminology that is commonly used in 
biostatistics and is detailed in [R] logistic postestimation. The ratio 
345/1241, called the sensitivity measure, gives the fraction of observations 
with y = 1 that are correctly specified. The ratio 1657/1965, called the 
specificity measure, gives the fraction of observations with y = 0 that are 
correctly specified. The ratios 308/1965 and 896/1241 are referred to, 
respectively, as the false positive and false negative classification error rates. 


The classification statistics are less useful when models provide poor fit 
and the event of interest occurs with probability close to 0 or 1. For example, 
a logit model for whether an individual is unemployed (y = 1), a low- 
probability event, may yield predicted probabilities as less than 0.5 for all 
individuals, in which case all observations are classified as zero. 


Machine-learning methods developed specifically for classification 
generally predict better than logit and probit models; section 28.6.7 provides 
a brief overview. 


17.5.4 The predict command for fitted probabilities 


Fitted probabilities can be computed by using the pr option of the predict 
postestimation command defined in section 13.5.1. The difference between 
logit and probit models may be small, especially over the middle portion of 
the distribution. On the other hand, the fitted probabilities from the Lpm fit 
by OLS may be substantially different. 


We first summarize the fitted probability from the three models that 
include only the hhincome variable as a regressor. 


. * Calculate and summarize fitted probabilities for models with a single 
> regressor 
. qui logit ins hhincome 


. predict plogit, pr 

. qui probit ins hhincome 
. predict pprobit, pr 

. qui regress ins hhincome 
. predict pols, xb 


. summarize ins plogit pprobit pols 


Variable Obs Mean Std. dev. Min Max 
ins 3,206 . 3870867 -4871597 (0) 1 
plogit 3,206 . 3870867 .0787632 .3176578 . 999738 
pprobit 3,206 .3855051 .061285 . 3349603 .9997945 
pols 3,206 . 3870867 .0724975 . 3360834 1.814582 


The mean and standard deviation are essentially the same in the three cases, 
but the range of the fitted values from the LPM includes six inadmissible 


values outside the [0, 1] interval. This fact should be borne in mind in 
evaluating the graph given below, which compares the fitted probability 
from the three models. The deviant observations from OLS stand out at the 
extremes of the range of the distribution, but the results for logit and probit 
cohere well. 


For regressions with a single regressor, plotting predicted probabilities 
against that variable can be informative, especially if that variable takes a 
range of values. Such a graph illustrates the differences in the fitted values 
generated by different estimators. The example given below plots the fitted 
values from logit, probit, and LPM against household income (hhincome). For 
graph readability, the jitter() option is used to jitter the observed 0 and 1 
values, leading to a band of outcome values that are around 0 and 1 rather 
than exactly 0 or 1. 


. * Plot of predicted probabilities for models with a single regressor 
. sort hhincome 


graph twoway (scatter ins hhincome, msize(vsmall) jitter(3)) 
(line plogit hhincome, clstyle(p1)) 
(line pprobit hhincome, clstyle(p2)) 
(line pols hhincome, clstyle(p3)), 
plotregion(style(none) ) 
title("Predicted probabilities across models") 
xtitle("Household income (hhincome)", size(medlarge)) xscale(titlegap(*5) ) 
ytitle("Predicted probability", size(medlarge)) yscale(titlegap(*5) ) 
legend(pos(1) ring(0) col(1)) legend(size(smal1l)) 
legend(label(1 "Actual data (jittered)") label(2 "Logit") 
label(3 "Probit") label(4 "OLS")) 
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The first panel of figure 17.1 presents the results. The divergence 
between the first two and the LPM (OLS) estimates at high values of income 
stands out, though this is not necessarily serious because the number of 
observations in the upper range of income is quite small. The fitted values 
are close for most of the sample. 
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Figure 17.1. Predicted probabilities versus hhincome and receiver 
operator characteristics curve following logit 


17.5.5 Receiver operator characteristics curve 


Suppose a binary outcome model yields predicted probabilities p; and we 
classify y; = 1 if p; > c, where the threshold c varies from 0 to 1. If ¢ = 1, 
we predict y; = 0 for all observations, so we never make the mistake of 
classifying y; = 1 when in fact y; = 0. And if c = 0, we predict y; = 1 for 
all observations, so we will never make the mistake of classifying y; = 0 
when in fact y; = 1. 


The receiver operator characteristics (ROC) provides a graphical way to 
plot the tradeoff between the two types of error as c varies. Sensitivity 
measures the fraction of observations with y = 1 that are correctly specified, 
while specificity measures the fraction of observations with y = 0 that are 
correctly specified. The Roc curve plots sensitivity against (1 — specificity). 


The 1roc command produces the ROC curve following the logit, probit, 
and ivprobit commands. We do so for logit regression of ins on income, 
though 1roc can follow regression with multiple regressors. 


* ROC curve following logit estimation 
. qui logit ins hhincome 
. lroc, lwidth(thick) title("Receiver operator characteristics curve") 
> msize (tiny) 


Logistic model for ins 


3206 
0.6982 


Number of observations 
Area under ROC curve 


The second panel of figure 17.1 plots the Roc curve. The area under the 
curve (AUC) is literally the area under the Roc curve; higher values are better. 
A perfectly fitting model has sensitivity of 1 and specificity of 1, and the 
AUC is then 1. A model with no explanatory power has a ROC curve with the 
diagonal line and area equal to 0.5. The current model has Auc equal to 
0.698. 


17.5.6 The margins command for fitted probabilities 


The predict command provides fitted probabilities for each individual, 
evaluating at x = x;. At times, it is useful to instead obtain predicted 
probabilities at a representative value, x = x*. 


This can be done by using the margins command, presented in 
section 13.6. For a 65-year-old, married, retired non-Hispanic with good 
health status, 17 years of education, and an income equal to $50,000 (so the 
income variable equals 50), we obtain 


. * Fitted probabilities for selected baseline using margins 
. qui logit ins retire $xlist, vce(robust) 


. Margins, at(age=65 retire=0 hstatusg=1 hhincome=50 educyear=17 married=1 hisp=0) 
> noatlegend 


Adjusted predictions Number of obs = 3,206 
Model VCE: Robust 


Expression: Pr(ins), predict() 


Delta-method 
Margin std. err. Zz P>lzl [95% conf. interval] 


_cons . 5705896 .0250894 22.74 0.000 5214154 6197638 


The probability of having private insurance is 0.57 with the 95% confidence 
interval [ 0.52, 0.62]. 


Note that this reasonably tight confidence interval is for the probability 
that y = 1 given x = x*. There is much more uncertainty in the outcome 
that y = 1 given x — x*. For example, this difficulty in predicting actual 
values leads to the low 72 for the logit model. This distinction is similar to 
that between predicting E(y|x) and predicting y|x discussed in 
sections 4.2.5 and 13.5.2. 


17.5.7 The prvalue command for fitted probabilities 


An alternative is to use the community-contributed prvalue postestimation 
command (Long and Freese 2014), which has syntax 


prvalue lif] [ in | E x(conditions) rest (stat) options | 


where we list two key options. The x (conditions) option specifies the 
conditioning values of the regressors, and the default rest (mean) option 
specifies that the unconditioned variables be set at their sample averages. 
Omitting x (conditions) means that the predictions are evaluated at x = X. 


The prvalue command evaluated at the same regressor values yields 


. * Fitted probabilities for selected baseline using prvalue 
. qui logit ins retire $xlist, vce(robust) 


. prvalue, x(age=65 retire=0 hstatusg=1 hhincome=50 educyear=17 married=1 hisp=0) 
logit: Predictions for ins 
Confidence intervals by delta method 

95% Conf. Interval 


Pr(y=11x): 0.5706 [ 0.5214, 0.6198] 
Pr (y=0|x): 0.4294 [ 0.3802, 0.4786] 
retire age hstatusg hhincome educyear married hisp 
x= 0 65 1 50 17 1 0 


The predicted probability and associated confidence interval are the same as 
those using the margins command. 


17.6 Marginal effects 


Three variants of MEs, presented in section 13.7, are the AME, marginal 
effects at a representative value (MER), and marginal effects at the mean 
(MEM). In a nonlinear model, MEs are more informative than coefficients. 


The analytical formulas for the MEs for the standard binary outcome 
models were given in table 17.1. For example, for the logit model, the ME 
with respect to a change in a continuous regressor, zj, evaluated at x = x, is 
estimated by A(x’ B){1 — A(x’ B)}B;- An associated confidence interval can 
be calculated by using the delta method. 


17.6.1 AME 


The AME Is the average of MEs for each individual and is obtained as the 
default option of the margins, dydx() postestimation command. The 
associated standard errors and confidence interval for the AME are obtained 
using the delta method. 


For the four binary regressors, we use the i. operator in the logit 
estimation command so that MEs are computed using the finite-difference 
method, rather than the calculus method; see, for example, section 13.7.5. 
For the fitted logit model, we obtain 


. * AME after logit 
. qui logit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp, 
> vce (robust) 


. margins, dydx(*) noatlegend // (AME) 


Average marginal effects Number of obs = 3,206 
Model VCE: Robust 


Expression: Pr(ins), predict() 
dy/dx wrt: 1.retire age 1.hstatusg hhincome educyear 1.married 1.hisp 


Delta-method 


dy/dx std. err. Zz P>I|zl [95% conf. interval] 

1.retire . 0426943 .0183334 2.33 0.020 .0067615 .0786272 
age -.0031693 . 0023904 -1.33 0.185 -.0078543 .0015158 
1.hstatusg 0675283 .0197001 3.43 0.001 0289167 . 1061399 
hhincome . 0005002 . 000248 2.02 0.044 .0000141 . 0009863 
educyear .0248111 . 0030334 8.18 0.000 .0188658 .0307565 
1.married . 1235562 .0194318 6.36 0.000 0854706 . 1616418 
1.hisp -. 1608825 .0336124 -4.79 0.000 -.2267616 -.0950034 


Note: dy/dx for factor levels is the discrete change from the base level. 


For example, on average across individuals, retirement is associated with a 
0.0427 higher probability of having insurance. 


The AMEs for the regressors are approximately 0.2 times the logit 
coefficient estimates. And these logit model AMEs are similar to the 
coefficients from OLS regression given in section 17.4.4. This is the usual 
case for single-index models such as logit. 


The AME just computed is an unweighted sample average. To obtain a 
population AME, we should additionally use the weight modifier of the 
margins command. 


17.6.2 MEM 


An alternative is to compute the ME for the average individual, using the 
atmeans option of the margins, dydx() command. We obtain 


. * MEM after logit 
. qui logit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp, 
> vce (robust) 


. margins, dydx(*) atmeans noatlegend // (MEM) 


Conditional marginal effects Number of obs = 3,206 
Model VCE: Robust 


Expression: Pr(ins), predict() 
dy/dx wrt: 1.retire age 1.hstatusg hhincome educyear 1.married 1.hisp 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 

1.retire .0457255 .0195761 2.34 0.020 .0073571 . 0840939 
age -.0034129 .0025767 -1.32 0.185 -.0084631 0016373 
1.hstatusg .0716613 0206194 3.48 0.001 031248 . 1120745 
hhincome . 0005386 . 0002692 2.00 0.045 .000011 .0010663 
educyear .0267179 . 0033438 7.99 0.000 0201642 0332716 
1.married .1295601 .0199356 6.50 0.000 .090487 . 1686332 
1.hisp -.1677028 .0338153 -4.96 0.000 - .2339796 -.1014261 


Note: dy/dx for factor levels is the discrete change from the base level. 


For example, for the average individual, retirement is associated with a 
0.0457 higher probability of having insurance. In this particular case, the 
MEMs are 0—10% greater than the AMEs. 


17.6.3 MER 


At times, it is useful to obtain the ME for a representative individual also. 
Note that the preceding MEMs were computed at sample average values, such 
as at married = 0.733, which literally means 73.3% married. The at () 
option of the margins, dydx() command computes the MER. 


We use as a benchmark a 75-year-old, retired, married Hispanic with 
good health status, 12 years of education, and an income equal to 35. Then, 


. * MER after logit 
. qui logit ins i.retire age i.hstatusg hhincome educyear i.married i.hisp, 
> vce (robust) 


. margins, dydx(*) at (retire=1 age=75 hstatusg=1 hhincome=35 educyear=12 


> married=1 hisp=1) noatlegend 


Conditional marginal effects 
Model VCE: Robust 


Expression: Pr(ins), predict() 


// (MER) 


Number of obs = 3,206 


dy/dx wrt: 1.retire age 1.hstatusg hhincome educyear 1.married 1.hisp 
Delta-method 

dy/dx std. err. Zz P>lz| [95% conf. interval] 
1.retire .0354151 .0150517 2.35 0.019 .0059142 .0649159 
age -.0027608 . 0020208 -1.37 0.172 -.0067216 .0012 
1.hstatusg .0544316 .016298 3.34 0.001 .022488 .0863752 
hhincome . 0004357 .0002199 1.98 0.048 4.70e-06 . 0008668 
educyear .0216131 . 0036858 5.86 0.000 .014389 .0288372 
1.married . 0935092 .0172199 5.43 0.000 .0597587 . 1272597 
1.hisp -. 1794232 . 0378334 -4.74 0.000 -.2535752 -.1052712 


Note: dy/dx for factor levels is the discrete change from the base level. 


For example, for this particular individual, retirement is associated with a 
0.0354 higher probability of having insurance. The MERs for the regressors 
are approximately 20% less than the AMEs. 


17.6.4 The prchange command for MEs 


The community-contributed prchange command (Long and Freese 2014) 
supplements the ME calculation that can be obtained using the margins 
command by reporting changes in probability induced by several types of 


change in the regressor, and not just a unit change or an infinitesimal change 
in the regressor. 


The syntax is similar to that of prvalue, discussed in section 17.5.7, 
prchange | varname | [ af | [ in | |> x(conditions) rest (stat) options | 


where varname is the variable that changes. The default for the conditioning 
variables is the sample mean. 


The following gives the ME of a change in income (hhincome) evaluated 


at the mean of regressors evaluated at x = x. 

. * Computing change in probability after logit 
. qui logit ins retire $xlist, vce(robust) 

. prchange hhincome 


logit: Changes in Probabilities for ins 


min->max O->1 -+1/2 -+sd/2 MargEfct 
hhincome 0.5679 0.0005 0.0005 0.0346 0.0005 
0 1 
Pr(ylx) 0.6272 0.3728 
retire age hstatusg hhincome educyear married 
x= .624766 66.9139 .704616 45.2639 11.8986 . 733001 
sd_x= -484259 3.67579 -456286 64.3394 3.30461 -442461 


hisp 
072676 
. 259645 


The output min->max gives the change in probability due to income changing 
from the minimum to the maximum observed value. The output 0->1 gives 
the change due to income changing from 0 to 1. The output -+1/2 gives the 
impact of income changing from a half unit below to a half unit above the 
base value. And the output -+sd/2 gives the impact of income changing 
from one-half a standard deviation below to one-half a standard deviation 
above the base value. The final column gives the MEM and equals the result 
obtained earlier using the margins, dydx(*) atmeans command. Adding 


the help option to this command generates explanatory notes for the 
computer output. 


17.7 Clustered data 


Section 13.9 presented in some detail various models and methods that can 
be applied when observations in the same cluster are correlated while 
observations in different clusters are uncorrelated. That discussion was 
illustrated using the Poisson model. Here we provide a brief summary of 
adaptation to logit and probit models. 


The simplest approach is to continue to use the logit and probit 
commands but use the vce (cluster clustvar) option. This maintains the 
assumption that the probability function for individual 7 in cluster g is Pr( 
Yig|Xig) = F'(x;,3). But it provides corrected standard errors that adjust 
for the loss in precision that arises because of observations no longer being 
independent within cluster. As with any cluster—robust estimate of standard 


errors, we assume that there are many clusters. 


Potentially more efficient estimates can be obtained by estimating using 
nonlinear feasible generalized least squares, assuming equicorrelation 
within cluster. For the logit model, this population-averaged approach uses 
the xtgee command with options family (binomial), link (logit), and 
corr (exchangeable). It is good practice to add the vce (robust) option 
because this provides cluster—robust standard errors that guard against 
within-cluster correlation not being exactly one of equicorrelation. An 
equivalent command is xtlogit, pa corr(exchangeable) vce(robust). 
One must first use the xt set command to define the cluster variable. 


Another way to obtain potentially more efficient estimates is to 
introduce a random intercept, so Pr(yig|Xig, Wg) = F (ag + X; B), where 
Qg is normally distributed. For the logit model, for example, one uses the 
xtlogit, re command. Equivalently, one can use the melogit command, 
which has the option of additionally allowing random-slope coefficients. 
Estimator consistency in this re model requires that the random effects be 
independent and identically distributed N (a, g2). So even though there is a 
vce (robust) option, if it needs to be used, then the parameter estimates are 
most likely inconsistent. 


Fixed-effects estimation is possible when there are many observations 
within each cluster. For example, in a regular logit or probit regression, 
simply add i.clid as regressors, where clid denotes the cluster identifier 
variable. 


When there are few observations per cluster, consistent estimation of a 
fixed-effects model is generally no longer possible because of the incidental 
parameters problem. It is possible for the logit model, using the xtlogit, 
fe command, to fit a conditional logit model, though it is not possible for 
the probit model. An alternative is the correlated random-effects model that 
fits a random-effects model with cluster-means of regressors as additional 
regressors; see section 6.6.5 for the linear case and section 22.4.9 for a logit 
panel example. 


Recent research has proposed bias-corrected methods for logit and 
probit estimation with fixed effects introduced as dummy variables for the 
case where there is a moderate number of clusters. These methods are 
implemented using the community-contributed commands probit fe and 
logitfe for probit and logit models; see Cruz-Gonzalez, Fernandez- Val, 
and Weidner (2017) and section 22.4.8. 


For proportions data, such as the share of an individual’s budget 
devoted to food, with the additional complication of clustering, the methods 
given in section 17.3.5 for the g1m command can be adapted to the xtgee 
command. For the logit model, one uses the xt gee command with options 
link (logit), corr (exchangeable), and vce (robust). 


The same issues arise with panel data. In that case, each individual is a 
cluster, and the observations for a given cluster are data for each time 
period that the individual is observed. A lengthier discussion for panel data 
on binary outcomes is given in section 22.4. 


17.8 Additional models 


We present models for binary outcomes that specify a more flexible 
functional form for the probability than the logit or probit models. 


17.8.1 Heteroskedastic probit model 


The standard probit and logit models can be motivated using the latent- 
variable model (17.3). This model assumes homoskedasticity of the errors, u, 
in the latent-variable model (17.3), a restriction that can be relaxed. 


Letting u; in (17.3) be heteroskedastic with a variance of 
o? = exp(z/.6) (17.6) 
we obtain the heteroskedastic probit model 
Pr(y; = 1x) = ® (x;8/0:) (17.7) 


where the exogenous variables (z1,..., Zm) do not contain a constant, so that 
the restriction 6 = 0 yields g? = 1 as in the standard probit model (including 
a constant in z would make the model unidentified). 


Note that the novelty is allowing the latent variable y* to be 
heteroskedastic. The observed binary variable yi is necessarily 
heteroskedastic even in the simpler probit model. 


ML estimation can be based on (17.6) and (17.7). The parameters of the 
probit model with heteroskedasticity can be estimated with ML by using 
Stata’s hetprobit command. The syntax for hetprobit is 


hetprobit depvar | indepvars | [ af | |in | | weight | , 
het (varlist E offset (varname_o) |) | options | 


As an illustration, we extend the probit model used in the preceding 
analysis, with z composed of the two variables age, already included in the x 
variable, and chronic. We obtain 


. * Heteroskedastic probit model 
. hetprobit ins retire $xlist, het(age chronic) nolog vce(robust) 


Heteroskedastic probit model Number of obs = 3,206 
Zero outcomes = 1,965 
Nonzero outcomes = 1,241 
Wald chi2(7) = 0.96 
Log pseudolikelihood = -1992.804 Prob > chi2 = 0.9954 
Robust 
ins | Coefficient std. err. Zz P>|z| [95% conf. interval] 
ins 

retire . 1887091 . 2547117 0.74 0.459 -.3105166 . 6879348 
age -.0190353 . 0329883 -0.58 0.564 -.0836911 .0456205 
hstatusg . 2877448 . 3768167 0.76 0.445 -.4508024 1.026292 
hhincome .0020191 . 0030096 0.67 0.502 - .0038796 0079178 
educyear . 1142743 . 1485502 0.77 0.442 -.1768787 . 4054272 
married .5898016 . 7571601 0.78 0.436 -. 8942049 2.073808 
hisp -.7520952 . 9298066 -0.81 0.419 -2.574483 1.070292 
_cons -1.380543 1.531981 -0.90 0.368 -4.38317 1.622083 

lnsigma 
age . 0084299 .0189439 0.44 0.656 - .0286993 0455592 
chronic - .0406405 . 0364957 -1.11 0.265 -.1121707 . 0308898 
Wald test of lnsigma=0: chi2(2) = 1.80 Prob > chi2 = 0.4071 


The two models can be compared by using a Wald test of § = o that is 
automatically implemented when the command is used. The Wald test 
indicates that at the 0.05 level, there is no statistically significant 
improvement in the model resulting from generalizing the homoskedastic 
model because p = 0.41. 


As a matter of modeling strategy, however, it is better to test first whether 
the z variables are omitted explanatory variables from the conditional mean 
model because such a misspecification is also consistent with variance 
depending on z. That is, the finding that z enters the variance function is also 
consistent with it having been incorrectly omitted from the conditional mean 
function. Accordingly, a variable addition test was also applied by adding 
chronic to the regressors in the probit model, and the p-value of the test was 


found to be 0.23. Thus, the evidence is also against the inclusion of chronic 
in the probit model. 


Parameter interpretation is more complicated in the heteroskedastic probit 


model, especially for a variable that appears in both x and z. The margins 
command provides MEs that control for this complication. 


17.8.2 Generalized logit 


Stukel (1988) considered, as an alternative to the logit model, the generalized 
h-family logit model 


ela (x’B) 


= 1 + ena (x’B) 


Na (x’B) (17.8) 


where h,(x’@) is a strictly increasing nonlinear function of x’ indexed by 
two shape parameters 1 and (2 that govern, respectively, the heaviness of 
the tails and the symmetry of the A(-) function. 


Rather than fit this richer model, Stukel proposed testing whether (17.8) 
is a better model by using a Lagrange multiplier, or score, test; see 
section 11.5. This test has the advantage that it requires estimation only of the 
null hypothesis logit model rather than of the more complicated model (17.8). 
Furthermore, the Lagrange multiplier test can be implemented by 
supplementing the logit model regressors with generated regressors that are 
functions of x’@ and by testing the significance of these augmented 
regressors. 


For example, to test for departure from the logit in the direction of an 
asymmetric h-family, we add the generated regressor (x’, B)? to the list of 
regressors, refit the logit model, and test whether the added variable 
significantly improves the fit of the model. We have 


. * Stukel score or Lagrange multiplier test for asymmetric h-family logit 
. qui logit ins retire $xlist 


. predict xbhat, xb 
. generate xbhatsq = xbhat^2 
. qui logit ins retire $xlist xbhatsq, vce(robust) 
. test xbhatsq 
( 1) [ins]xbhatsq = 0 


chi2( 1) 
Prob > chi2 


28.84 
0.0000 


The null hypothesis of correct model specification is strongly rejected 
because the Wald test of zero coefficient for the added regressor (x’ 8)? 


yields a y? (1) statistic of 28.8 with p = 0.000. 


This test is easy to apply and so are several other score tests suggested by 
Stukel that use the variable-augmentation approach. At the same time, recall 
from section 3.7.6 that tests have power in more than one direction. Thus, 
rejection in the previous example may be for reasons other than the need for 
an asymmetric -family logit model. For example, perhaps it is enough to use 
a logit model with additional inclusion of polynomials in the continuous 
regressors or inclusion of additional variables as regressors. 


17.8.3 Alternative-specific random parameters logit or mixed logit 


Binary outcome data are often the result of choice between two alternatives. 
Let the two alternatives be denoted 0 and 1, and suppose that some regressors 
vary at both the individual level and the alternative level, while other 
regressors vary over the individual level. Specifically, let Xio and X;1 denote 
values of regressors that vary across the two alternatives, and let Z; denote 
regressors that do not vary across alternatives. A simple form of this model 
can be fit by logit regression of Yi on X; — Xo and Zi, where y; = 1 if 
alternative 1 is chosen. 


The alternative-specific random parameters logit model or mixed model 
specifies the alternative-varying regressors Xio and X;1 to have associated 
parameters 6; that vary at the individual level as independent draws from a 
normal distribution. This model is used much more with multinomial 
outcomes, rather than binary outcomes, and is presented in detail in 
section 18.8. 


Note that the model just described is a random parameters or mixed 
model different from the models in sections 13.9.3 and 17.7, which have 
intercept and possibly slope coefficients varying by cluster. Here the variation 
is by individual for alternative-specific regressors. 


17.8.4 Nonparametric logit estimation 


The focus of this chapter has been obtaining parameter estimates and 
associated MEs for binary outcomes. 


We may instead want to predict the outcome itself, for example, whether 
an individual has private insurance (y = 1) or does not have insurance 
(y = 0). Methods specifically intended for classification that are more 
flexible than logit and probit, such as discriminant analysis and support 
vector machines, are summarized in section 28.6.7. 


And we may want to predict the probability that y = 1. This is an 
essential input for propensity-score matching methods and inverse- 
probability weighted estimators that are extensively used in the treatment- 
effects literature. One way to proceed is to fit a logit model with a flexible 
combination of regressors that includes interactions and powers of key 
variables. An alternative method fits a nonparametric model, albeit one that 
restricts the probability to lie between 0 and 1. 


We consider the local logit estimator at x = xo that maximizes with 
respect to œo and Bo the weighted logit log density 


N 
> wa (Xi —Xo) x (yi ln A {ao + (xi — xo) Bo} +(1-—yi)ln[1 — A {ao + (x; — Xo)'Bo}]) 


where wp (X; — Xo) are kernel weights. This is the same as the local linear 
estimator presented in section 27.2, except that squared residuals are replaced 
by the log density of the logit model. 


The community-contributed ivqte package (see section 25.9) includes a 
command locreg that with option 1ogit implements the local logit estimator. 


We apply this command to the example of this chapter. For continuous 
regressors, we use the default epan2 with bandwidth in the range (0.5, 0.8), 
and for binary regressors, we use the Li-Racine kernel with bandwidth in the 
range (0.8, 1). We obtain 


* Local logit regression 
. locreg ins, dummy(retire hstatusg married hisp) 
> continuous(age hhincome educyear) logit bandwidth(0.5 0.8) 
> lambda(0.8 1.0) generate(ploclog, replace) 


Leave-one-out cross-validation 


Bandwidth Mean Squared Error 
5 . 23070108 
5 . 23062031 
8 . 21783086 
8 . 21734733 


Among the grid of values tested, the optimal bandwidth is .8 and the optimal 
> lambda is 1. 


The fitted values obtained with the optimal smoothing parameters have been 
> saved in ploclog. 


We obtain logit ML predictions and compare them with those from the 
local logit model. 


* Compare local logit and logit ML predicted probabilities 
. qui logit ins retire $xlist 


. qui predict plogitml 
. qui scatter ploclog plogitml, xtitle("Logit ML predicted probability") 


> ytitle("Local logit predicted probability") msize(tiny) 
> scale(1.2) 


Figure 17.2 plots the predicted probabilities from the local logit model 
against those obtained following logit ML estimation. The local logit model 
provides a greater spread of predicted values, including some predictions at 
the boundary values of 0 and 1. In principle, this greater spread is preferred, 
though it will pose a challenge for inverse-probability weighting methods. 


Ayyiqeqoid 


9 v 
payolpeid 460] | 


z 


CP 
costo ass, 
Br oo 


2007 


Logit ML predicted probability 


Figure 17.2. Predicted probabilities from local logit compared with 


logit ML 


17.9 Endogenous regressors 


The probit and logit ML estimators are inconsistent if any regressor is 
endogenous. Two distinct broad approaches are used to correct for 
endogeneity. 


The structural approach specifies a complete model that explicitly 
models both nonlinearity and endogeneity. The specific structural model 
used differs according to whether the endogenous regressor is discrete or 
continuous. ML estimation is most efficient, but computationally simpler 
(albeit less efficient) two-step estimators are often used. Wooldridge (2010, 
chap. 15.7) provides a detailed exposition of these methods. 


The alternative partial model or semiparametric approach defines an 
error term for the equation of interest and uses the Iv estimator based on the 
orthogonality of instruments and this error term. 


As in the linear case, a key requirement 1s the existence of one or more 
valid instruments that do not directly explain the binary dependent variable 
but are correlated with the endogenous regressor. Unlike the linear case, 
different approaches to controlling for endogeneity can lead to different 
estimators even in the limit because the parameters of different models are 
being estimated. 


17.9.1 Example 


We again model the binary outcome ins, though we use a different set of 
regressors. The regressors include the continuous variable linc (the log of 
household income) that is potentially endogenous because purchase of 
supplementary health insurance and household income may be subject to 
correlated unobserved shocks, even after controlling for a variety of 
exogenous variables. That is, for the HRS sample under consideration, the 
choice of supplementary insurance (ins), as well as household income 
(linc), may be considered as jointly determined. 


Regular probit regression that does not control for this potential 
endogeneity yields 


. * Endogenous probit using inconsistent probit MLE 
= log(hhincome) 


. generate linc 
(9 missing values generated) 


. global xlist2 female age age2 educyear married hisp white chronic adl hstatusg 


. probit ins linc $xlist2, vce(robust) nolog 


Probit regression 


Log pseudolikelihood = -1933.4275 


ins 


linc 
female 
age 

age2 
educyear 
married 
hisp 
white 
chronic 
adl 
hstatusg 
_cons 


Coefficient 


. 3466893 
-.0815374 
. 1162879 
-.0009395 
. 0464387 
. 1044152 
-. 3977334 
-.0418296 
. 0472903 
-.0945039 
. 1138708 
-5.744548 


Robust 
std. 


.0402173 
. 0508549 
.1151924 
. 0008568 
. 0089917 
. 0636879 
. 1080935 
.0644391 
.0186231 
. 0353534 
.0629071 
3.871615 


err. 


P>|z| 


oo0oo0oo000000000O 


.000 
.109 
.313 


273 


.000 
.101 
.000 


516 


.011 
.008 
.070 
.138 


Number of obs 
Wald chi2(11) 
Prob > chi2 
Pseudo R2 


[95% conf. 


. 2678648 
-.1812112 
-. 109485 
-.0026187 
.0288153 
-.0204108 
. 6095927 
-. 168128 
.0107897 
. 1637953 
. 0094248 
-13.33277 


= 3,197 
366.94 
0.0000 
= 0.0946 


interval] 


4255137 
0181364 
. 3420608 
. 0007397 
. 0640622 
. 2292412 
-. 1858741 
. 0844687 
. 0837909 
-.0252125 
. 2371664 
1.843677 


The regressor linc has coefficient 0.35 and is quite precisely estimated with 
a standard error of 0.04. The associated ME at x = X, computed using the 
margins, dydx(linc) atmeans command, is 0.13. This implies that for the 
average individual, a 10% increase in household income (a change of 0.1 in 
linc) is associated with an increase of 0.013 in the probability of having 
supplementary health insurance. 


17.9.2 Structural model 


We restrict attention to the case of a single continuous endogenous regressor 
in a binary outcome model. For a discrete endogenous regressor, other 
methods should be used. 


We consider the following linear latent-variable model, in which yj is 
the dependent variable in the structural equation and y2 is a continuous 
endogenous regressor in this equation. These two endogenous variables are 
modeled as linear in exogenous variables Xi and X2. That is, 


yli = BYzi + XY + ui (17.9) 
Yai = XiT] F XT + vi (17.10) 


where i = 1,..., N; X1 isa K4 x 1 vector of exogenous regressors; and 
X2is a Kə x 1 vector of additional instrumental variables that affect y2 but 
can be excluded from (17.9) because they do not directly affect y1. 
Identification requires that Kə > 1. 


The variable yř is latent and hence is not directly observed. Instead, the 
binary outcome y1 is observed, with yı = lif yï > 0 and yı = Oify; < 0. 


Equation (17.9) is interpreted as being “structural”; see section 7.2.1. 
System (17.9)—(17.10) has a special restricted triangular structure because 
(17.10) restricts Y2: to not depend on yj, once X1; and X2; are included. 


Equation (17.10), called a first-stage equation or reduced-form equation, 
serves only as a source of identifying instruments. It explains the variation in 
the endogenous variable in terms of strictly exogenous variables, including 
the instrumental variables X2 that are excluded from the structural equation. 
These excluded instruments, previously discussed in sections 7.2 and 7.3 
within the context of linear models, are essential for identifying the 
parameters of the structural equation. 


Given the specification of the structural and first-stage equations, 
estimation can be simultaneous (that is, joint) or sequential. 


Before presenting these estimators, we note that an alternative model, a 
simultaneous-equations model, replaces the reduced-form equation (17.10) 
for Y2i with a structural equation that additionally includes yj, as a regressor. 
The community-contributed cds imegq command (Keshk 2003) implements a 
two-stage estimation method for that model. 


17.9.3 Structural-model estimators 


The structural-model approach completely specifies the distributions of yj 
and y2 in (17.9) and (17.10). It is assumed that (u;, v;) are jointly normally 
distributed, that is, (u;, vi) ~ N(0, £), where X = (0;;). In the binary 
probit model, the coefficients are identified up to a scale factor only; hence, 


by scale normalization, o;,; = 1. The assumptions imply that 

uiļvi = pu; + £i where E(e;ļv;) = 0. A test of the null hypothesis of 
exogeneity of Y2 is equivalent to the test of Ho: p = 0 because then u; and vi 
are independent. 


This approach relies greatly on the distributional assumptions. Consistent 
estimation requires both normality and homoskedasticity of the errors u; and 
Ui. 


The ivprobit command 


The syntax of ivprobit is similar to that of ivregress, discussed in 
section 7.4. 


ivprobit depvar | varlist1 | (varlist2=varlist_iv) [ af | [ in ] | weight | 
Ing mle_options | 


where varlist2 refers to the endogenous variable y2 and varlist_iv refers to 
the instruments X2 that are excluded from the equation for yï. The default 
version of ivprobit delivers ML estimates, and the twostep option yields 
two-step estimates. The postestimation command margins following ML 
estimations provides marginal effects; see the Remarks and examples section 
of [R] ivprobit postestimation. 


ML estimates 


For this example, we use as instruments two excluded variables, retire and 
sretire. These refer to, respectively, individual retirement status and spouse 
retirement status. These are likely to be correlated with 1inc because 
retirement will lower household income. The key assumption for instrument 
validity is that retirement status does not directly affect choice of 
supplementary insurance. This assumption is debatable, and this example is 
best viewed as merely illustrative. We apply ivprobit, obtaining ML 
estimates: 


. * Endogenous probit using ivprobit ML estimator 
. global ivlist2 retire sretire 


ivprobit ins $xlist2 (linc 


= $ivlist2), vce(robust) nolog 


Probit model with endogenous regressors Number of obs = 3,197 

Wald chi2(11) = 382.35 

Log pseudolikelihood = -5407.7151 Prob > chi2 = 0.0000 
Robust 

Coefficient std. err. Zz P>|z| [95% conf. interval] 

linc . 5338252 . 3852132 -1.39 0.166 -1.288829 .2211788 

female . 1394072 .0494471 -2.82 0.005 -. 2363218 - .0424926 

age . 2862293 . 1280821 2.23 0.025 0351929 . 5372656 

age2 0021472 0009318 -2.30 0.021 -.0039735 -.0003209 

educyear . 1136881 .0237914 4.78 0.000 .0670579 . 1603183 

married . 7058309 . 2377594 2.97 0.003 . 239831 1.171831 

hisp . 5094514 . 1049487 -4.85 0.000 -.715147 -.3037558 

white .1563454 .1035674 1.51 0.131 - .0466429 . 3593338 

chronic 0061939 027525 0.23 0.822 -.0477542 060142 

adl . 1347664 . 0349799 -3.85 0.000 - . 2033258 -.0662071 

hstatusg . 2341789 .0709755 3.30 0.001 .0950694 . 3732883 

_cons -10.00787 4.065771 -2.46 0.014 -17.97664 -2.039107 

corr(e.linc, 

e.ins) . 5879559 . 2355329 -.0309872 . 8809669 

sd(e.linc) 7177787 .0167816 . 6856296 . 7514352 

Wald test of exogeneity (corr = 0): chi2(1) = 3.51 Prob > chi2 = 0.0610 


Instrumented: linc 


Instruments: female age age2 educyear married hisp white chronic adl 


hstatusg retire sretire 


The output includes a test of the null hypothesis of exogeneity, that is, 

Ho: p = 0. The p-value is 0.061, so Ho is not rejected at the 0.05 level, 
though it is rejected at the 0.10 level. That the estimated coefficient is 
positive indicates a positive correlation between u and v. Those unmeasured 
factors that make it more likely for an individual to have a higher household 
income also make it more likely that the individual will have supplementary 
health insurance, conditional on other regressors included in the equation. 


Given the large estimated value for P (p = 0.59), we should expect that 
the coefficients of the estimated probit and ivprobit models differ. This is 
indeed the case, for both the endogenous regressor linc and for the other 
regressors. The coefficient of 1inc actually changes signs (from 0.35 to 
— 0.53), so that an increase in household income is estimated to lower the 


probability of having supplementary insurance. One possible explanation is 
that richer people are willing to self-insure for medical services not covered 
by Medicare. At the same time, Iv estimation has led to much greater 
imprecision, with the standard error increasing from 0.04 to 0.39, so that the 
negative coefficient is not statistically significantly different from 0 at the 
0.05 level. Taken at face value, however, the result suggests that the probit 
command that neglects endogeneity leads to an overestimate of the effect of 
household income. The remaining coefficients exhibit the same sign pattern 
as in the ordinary probit model, and the differences in the point estimates are 
within the range of estimated standard errors. 


The same ML estimates are obtained using the following eprobit 
command: 


. * Endogenous probit using eprobit ML estimator (an ERM command) 
. eprobit ins $xlist2, endogenous(linc = $xlist2 $ivlist2) vce(robust) nolog 


(output omitted ) 


The suppressed output includes estimates of the first-stage equation for linc. 
To obtain this additional output using the i vprobit command, one adds the 
first option. 


The eprobit command is a member of the class of extended regression 
model commands that provide ML estimates for models with endogenous 
regressors or sample selection, or both, when the full model is a recursive 
model such as (17.9) and (17.10) and error terms are joint normal 
distributed. Section 23.7 provides an overview of extended regression model 
commands and several examples of the eprobit command. 


Control function estimator 


Rivers and Vuong (1988) proposed a control function estimator analogous to 
that presented in section 7.4.7. The residual ©; from first-stage 

OLS regression of (17.10) is included as an additional regressor in a second- 
stage probit regression of (17.9). A test of whether the coefficient of this 
extra term is zero is a test of endogeneity. If there is endogeneity, then 
adjustment is needed because inference needs to account for first-stage 
estimation to compute ©; and because parameter estimates need to be 


rescaled by 4/1 — p2, where p = Cor(u;, vi). One can bootstrap or adapt the 
two-step method of section 13.3.11. 


This two-step estimator requires the same strong distributional 
assumptions as the MLE. The advantage is computational, especially for 


extension to more than one endogenous regressor. 


Two-step sequential estimates 


Newey (1987) proposed an alternative two-step estimator, a minimum chi- 
squared estimator, that also is computationally simpler than the MLE but is 
more efficient than the preceding two-step estimator of Rivers and 

Vuong (1988). It also requires the same strong distributional assumptions as 
the MLE. 


The estimator is implemented by using ivprobit with the twostep 
option. 


We do so for our data, using the first option, which also provides the 
least-squares (OLS) estimates of the first stage. Note that robust standard 
errors are unavailable. 


. * Endogenous probit using ivprobit two-step estimator 
. ivprobit ins $xlist2 (linc = $ivlist2), twostep first 
Checking reduced-form model... 

first-stage regression 


Source SS df MS Number of obs = 3,197 
F(12, 3184) = 188.99 

Model 1173.12053 12 97.7600445 Prob > F = 0.0000 
Residual 1647 . 03826 3,184 .517285885 R-squared Ş 0.4160 
Adj R-squared = 0.4138 

Total 2820.15879 3,196 .882402626 Root MSE = . 71923 
linc | Coefficient Std. err. t P>|t| [95% conf. interval] 
retire -.0909581 .0288119 -3.16 0.002 -.1474499 -.0344663 
sretire - .0443106 .0317252 -1.40 0.163 -.1065145 .0178932 
female -.0936494 .0297304 -3.15 0.002 -.151942 -.0353569 
age . 2669284 .0627794 4.25 0.000 . 1438361 . 3900206 

age2 -.0019065 . 0004648 -4.10 0.000 -.0028178  -.0009952 
educyear .094801 . 0043535 21.78 0.000 .0862651 . 1033369 
married . 7918411 .0367275 21.56 0.000 . 7198291 . 8638531 
hisp -.2372014 . 0523874 -4.53 0.000 -.3399179 -. 134485 
white . 2324672 . 0347744 6.69 0.000 . 1642847 . 3006496 
chronic - . 0388345 .0100852 -3.85 0.000 -.0586086 -.0190604 
adl -.0739895 .0173458 -4.27 0.000 -.1079995 -.0399795 
hstatusg . 1748137 .0338519 5.16 0.000 . 10844 .2411875 
_cons -7.702456 2.118657 -3.64 0.000 -11.85653 -3.548385 
Two-step probit with endogenous regressors Number of obs = 3,197 
Wald chi2(11) S 222.51 

Prob > chi2 = 0.0000 

Coefficient Std. err. z P>|z| [95% conf. interval] 

linc -.6109088 .5723054 -1.07 0.286 -1.732607 .5107893 
female -.167917 .0773839 -2.17 0.030 -.3195867 -.0162473 
age . 3422526 . 1915485 1.79 0.074 -.0331756 . 7176808 

age2 -.0025708 .0014021 -1.83 0.067 -.0053188 .0001773 
educyear . 13596 . 0543047 2.50 0.012 .0295249 . 2423952 
married .8351517 .441743 1.89 0.059 - . 0306487 1.700952 
hisp -.6184546 . 181427 -3.41 0.001 -.9740451 -.2628642 
white . 1818279 . 1528281 1.19 0.234 -.1177098 - 4813655 
chronic .0095837 .0309618 0.31 0.757 -.0511004 .0702678 
adl - . 1630884 . 0568288 -2.87 0.004 -.2744709 -.0517059 
hstatusg . 2809463 . 1228386 2.29 0.022 .0401871 .5217055 
_cons -12.04848 5.928158 -2.03 0.042 -23.66746  -.4295071 
Wald test of exogeneity: chi2(1) = 3.57 Prob > chi2 = 0.0588 


Instrumented: linc 
Instruments: female age age2 educyear married hisp white chronic adl 
hstatusg retire sretire 


The results of the two-step estimator are similar to those from the ivprobit 
ML estimation. The coefficient estimates are within 20% of each other. The 
standard errors are increased by approximately 50%, indicating a loss of 
precision in two-step estimation compared with ML estimation. The test 
statistic for exogeneity of linc has a p-value of 0.059 compared with 0.061 
using ML. The results for the first stage indicate that one of the two excluded 
instrumental variables has a strong predictive value for 1inc. Because this is 
a reduced-form equation, we do not attempt an interpretation of the results. 


Weak instruments 


If instruments are weakly correlated with the endogenous regressor, then the 
usual asymptotic theory may perform poorly. Then alternative methods may 
be used. Inference for weak instruments for the linear model is presented in 

section 7.7. 


The discussion there includes methods based on minimum distance 
estimation due to Magnusson (2010) that can be applied to a wide range of 
linear and nonlinear structural models. The community-contributed rivtest 
(Finlay and Magnusson 2009) and weakiv programs (Finlay, Magnusson, 
and Schaffer 2014) apply these methods following estimation using 


ivprobit. 


17.9.4 Linear two-stage least-squares approach 


A popular approach is to ignore the binary nature of the dependent variable 
yı (ins) and simply estimate by two-stage least-squares (2SLS). 


Advantages of the linear 2sLs estimator are its computational simplicity 
and the ability to use tests of validity of overidentifying instruments and 
diagnostics for weak instruments in linear models that were presented in 
chapter 7. At the same time, formal tests and inference that require normal 
homoskedastic errors will be inappropriate because of the intrinsic 
heteroskedasticity when the dependent variable is binary. 


Our own view is that at best linear 2sLs may provide an initial guide, but 
ultimately one should use other methods that allow for the binary nature of 
the dependent variable. 


We have the standard linear formulation for the observed variables 
(yi, y2)» 


Yii = BYzi + X4 Y + ui 


/ f 
Y2i = Xi T1 + Xo T2 + Vi 


where y2 is endogenous and the covariates X2 are the excluded 
exogenous regressors (instruments). This is the model (17.9) and (17.10), 
except that the latent-variable yj is replaced by the binary variable y1. An 
important difference is that while (u, v) are zero-mean and jointly 
dependent, they need not be multivariate normal and homoskedastic. 


Estimation is by 2SLS, using the ivregress command. Because y1 is 
binary, the error u is heteroskedastic. The 2SLs estimator is then still 
consistent for (8, y), but heteroskedasticity-robust standard errors should be 
used for inference. 


The ivregress command with the vce (robust) option yields 


* Endogenous probit using ivregress to get 2SLS estimator 
ivregress 2sls ins $xlist2 (linc = $ivlist2), vce(robust) noheader 


Robust 
ins Coefficient std. err. z P>|z| [95% conf. interval] 
linc -.167901 . 1937801 -0.87 0.386 -.547703 .2119011 
female -.0545806 . 0260643 -2.09 0.036 -.1056657 -.0034955 
age . 106631 . 0624328 1.71 0.088 -.015735 . 228997 
age2 -.0008054 .0004552 -1.77 0.077 -.0016977 . 0000868 
educyear .0416443 .0182207 2.29 0.022 .0059324 .0773562 
married .2511613 . 1499264 1.68 0.094 -.042689 .5450116 
hisp -.154928 .0546479 -2.84 0.005 - . 2620358 -.0478202 
white .0513327 .0508817 1.01 0.313 - .0483936 . 151059 
chronic .0048689 .0103797 0.47 0.639 -.015475 .0252128 
adl -.0450901 .0174479 -2.58 0.010 -.0792874 -.0108928 
hstatusg .0858946 .041327 2.08 0.038 .0048951 . 1668941 
_cons -3.303902 1.920872 -1.72 0.085 -7.068743 . 4609388 


linc 


Instrumented: 
Instruments: female age age2 educyear married hisp white chronic adl 
hstatusg retire sretire 


. estat overid 
Test of overidentifying restrictions: 
Score chi2(1) = .521843 (p = 0.4701) 


This method yields a coefficient estimate of — 0.17 of linc that is 
statistically insignificant at level 0.05, as for ivprobit. To compare 
ivregress estimates with ivprobit estimates, we need to rescale parameters 
as in section 17.4.4. Then the rescaled 2SLs parameter estimate is 

— 0.17 x 2.5 = —0.42, comparable with the estimates of — 0.53 and — 0.61 
from the ivprobit command. 


Here the single overidentifying restriction is not rejected by the Hansen 
J test, which yields a y?(1) value of 0.522. 


17.9.5 Nonlinear IV approach 


If an Iv estimation approach is taken, it is better to use the following 
nonlinear Iv estimator that adapts for the binary nature of the dependent 
variable. 


The linear 2sLs estimator of the model (17.11) is based on the moment 
condition E(u|x1, x2) = 0, where u = yı — (By2 + x7); see section 7.3.2. 
For a binary outcome yı modeled using the probit model, it is better to 
instead define the error term, the difference between yı and its conditional 
mean function, as u = yı — ®(By2 + x17). 


There is no Stata command to implement the subsequent nonlinear Iv 
estimator, but the nonlinear Iv example in section 13.3.10 for the Poisson can 
be suitably adapted, replacing exp(x/3) with (x; 3) for probit or A(x; 3) 
for logit. Estimation is based on a moment condition that is not implied by 
(17.9) and (17.10), so the estimators will differ even in the limit from those 
from the ivprobit command. 


For the current overidentified example, two-step nonlinear Iv or 
generalized method of moments estimates are obtained using the command 


. * Endogenous probit using nonlinear IV or generalized method of 
> moments estimation 
. gum (ins - normal({xb:linc $xlist2 _cons})), instruments($xlist2 $ivlist2) 


(output omitted ) 


From output not given, the endogenous regressor 1 inc has coefficient 
— 0.490, compared with — 0.534 and — 0.611 from ivprobit, with robust 


standard error 0.568. The postestimation command estat overia performs 
an overidentifying restrictions test. 


17.10 Grouped and fractional data 


In some applications, only grouped or aggregate data may be available, yet 
individual behavior is felt to be best modeled by a binary choice model. For 
example, we may have a frequency average taken across a sampled 
population as the dependent variable and averages of explanatory variables 
for the regressors, which we will assume to be exogenous. Such data are also 
called proportions or fractional data. 


There are several ways to proceed, some introduced in section 17.3.5. 
17.10.1 Grouped dataset 


We use the dataset of this chapter with age as the grouping variable. This 
generates 33 groups, one for each age between 52 and 86; there are no 
observations for ages 84 or 85. The number of cases in the 33 groups are as 
follows: 


4 5 2 2 T 8 34 62 72 #51 6l 
67 74 524 470 488 477 286 133 100 91 67 
36 29 19 1l 8 11 4 6 5 1 1 


Observations with no within-group variation are dropped; this is likely to 
occur when the group size is small. In the present sample, we drop two 
groups with 4 or fewer observations, reducing the sample size to 27. Because 
of the small sample size, we include only two variables in the regression 
models. 


The full individual dataset of 3,206 observations can be converted to an 
aggregate dataset by using the following Stata commands that generate 
group averages, and counts the number of observations (and those with 
ins==1) in each cell. 


* Create grouped data 
. qui use mus217hrs, clear 


. bysort age: egen num_in_cell = count (ins) 
. bysort age: gen num_ins_in_cell = num_in_cell*ins 
sort age 
collapse n=num_in_cell r=num_ins_in_cell av_ins=ins 
> av_educyear=educyear av_hstatusg=hstatusg, by (age) 
. drop if n< 5 
(6 observations deleted) 


summarize 
Variable Obs Mean Std. dev. Min Max 
age 27 68 8.194745 53 82 
n 27 118.2222 168.0651 5 524 
r 27 45.88889 69.16165 1 226 
av_ins 27 . 3446255 . 1068683 .125 . 6666667 
av_educyear 27 11.28696 1.0203 8.909091 12.5 
av_hstatusg 27 5789595 . 2262814 (0) . 8888889 


Here the collapse command with its default mean statistic is used to form 
averages by age. For example, collapse av_ins=ins, by (age) creates 27 
observations for the av_ins variable equal to the average of the ins variable 
for each of the 27 distinct values taken by the age variable. More generally, 
collapse can compute other statistics, such as the median specifying the 
median Statistic, and if the by() option was not used, then just a single 
observation would be produced. 


The variable n gives the number of individuals in each distinct age 
group, and the variable r gives the number of individuals with insurance in 
each distinct age group, so avxins = r/n. 


17.10.2 Comparison of various grouped estimators 


We can view the data as coming from a binomial model with, for the gth 
group, Tg successes in nyg trials; here r successes in n trials. This model can 
be fit using the gim command with options family (binomial) and, fora 
logit model, 1ink (logit). We obtain 


. * Grouped data: Binomial logit model for number insured (= n times y) 
. glm r av_educyear av_hstatusg, family(binomial n) link(logit) noheader 
Iteration 0: log likelihood = -66.665299 

Iteration 1: log likelihood = -66.649602 

Iteration 2: log likelihood = -66.649602 


OIM 
r Coefficient std. err. Z P>|zl [95% conf. interval] 
av_educyear . 2187254 . 1055729 2.07 0.038 .0118062 . 4256445 
av_hstatusg . 3309241 .4161667 0.80 0.427 -.4847477 1.146596 
_cons -3.296212 1.047531 -3.15 0.002 -5.349335 -1.24309 


In the special case that, within each given age group, all individuals had the 
same value of the regressors, the preceding results would be identical to 
fitting a logit model on this grouped dataset expanded to have a separate 
observation for each individual. 


The grouped data may be treated as proportions data, not necessarily 
arising from averaging individual binary outcomes. Then the proportion 
insured (av_ins) is modeled to have conditional mean E(y,|x) = A(X, (). 
Nonlinear least-squares estimation yields 


. * Grouped data: NLS with actual y as dependent variable 


. nl (av_ins = 1 / (1 + exp(-{xb: av_educyear av_hstatusg}+{b0}))), vce(robust) 
> nolog 


Nonlinear regression Number of obs = 27 

R-squared = 0.9266 

Adj R-squared = 0.9174 

Root MSE = . 1035396 

Res. dev. = -49.01872 

Robust 

av_ins | Coefficient std. err. t P>|t| [95% conf. interval] 
/xb_av_educy~r 1711613 . 1173766 1.46 0.158 -.0710922 -4134148 
/xb_av_hstat~g .0763085 . 4834848 0.16 0.876 -.921555 1.074172 
/b0 2.625174 1.091327 2.41 0.024 . 3727866 4.877561 


An alternative proportions estimator, introduced in section 17.3.5, 
additionally models the heteroskedasticity. This uses the binary outcome 
objective function (17.5). The logit and probit commands cannot be 
applied to proportions data for the simple reason that they are coded to work 
only with a dependent variable that takes binary values. Instead, we use the 


fracreg logit command (or fracreg probit for probit). Heteroskedastic- 
robust standard errors are computed by default; other options include 
cluster-robust standard errors. We obtain 


. * Grouped data: "logit" with actual y as dependent variable 
. fracreg logit av_ins av_educyear av_hstatusg, vce(robust) noheader nolog 


Robust 
av_ins Coefficient std. err. Zz P>|z| [95% conf. interval] 
av_educyear . 1579758 . 1045167 1.51 0.131 -.0468731 . 3628247 
av_hstatusg .1147869 . 4343325 0.26 0.792 -.7364891 . 966063 
_cons -2.49697 .9749101 -2.56 0.010 -4.407759 -.5861815 


In this example, there is some efficiency gain in modeling the 
heteroskedasticity, and this method is viewed as the best method for 
fractional response data. 


A final approach uses the logistic transformation to redefine the 
dependent variable to be unbounded, rather than restricted to the (0, 1) 
interval, and estimate by OLs the parameters of the model 


y 1g 


where Ug is an error and robust standard errors are used. We obtain 


* Grouped data: least squares with transformation of y as dependent variable 
. generate logins = log(av_ins/(1-av_ins)) 


. regress av_ins av_educyear av_hstatusg, vce(robust) noheader 


Robust 
av_ins Coefficient std. err. t P>|t| [95% conf. interval] 
av_educyear .0344108 .0231204 1.49 0.150 -.0133073 .0821289 
av_hstatusg .0287159 .0970939 0.30 0.770 -.1716761 . 229108 
_cons -.0603931 .2151748 -0.28 0.781 -.504492 . 3837058 


A limitation of this method is that we are generally interested in explaining 
E(Yg|Xq) rather than E[In{y,/(1 — Yg) }|Xg]. 


17.11 Additional resources 


The logit and probit models are the most commonly used nonlinear models, 
and for many purposes, the logit and probit commands are sufficient. For 
these nonlinear models, it is common to report the AME (obtained using the 
margins, dydx() command) in place of, or in addition to, parameter 
estimates. The most common complications for binary outcome models are 
endogenous regressors (see section 17.9) and panel data (see section 22.4). 


17.12 Exercises 


1. Consider the example of section 17.4 with dependent variable ins and 
the single regressor educyear. Estimate the parameters of logit, probit, 
and OLS models using both default and robust standard errors. For the 
regressor educyear, compare its coefficient across the models, 
compare default and robust standard errors of this coefficient, and 
compare the ¢ statistics based on robust standard errors. For each 
model, compute the ME of one more year of education for someone 
with sample mean years of education, as well as the AME. Which 
model fits the data better—logit or probit? 

2. Use the cloglog command to estimate the parameters of the binary 
probability model for ins with the same explanatory variables used in 
the logit model in this chapter. Estimate the AME for the regressors. 
Calculate the odds ratios of ins=1 for the following values of the 
covariates: age=50, retire=0, hstatusg=1, hhincome=45, 
educyear=12, married=1, and hisp=0. 

3. Generate a graph of fitted probabilities against years of education 
(educyear) or age (age) using as a template the commands used for 
generating figure 17.1 in this chapter. 

4. Estimate the parameters of the logit model of section 17.4.2. Now, 
estimate the parameters of the probit model using the probit 
command. Use the reported log likelihoods to compare the models by 
the Akaike’s information criterion and BIC. 

5. Estimate the probit regression of section 17.4.4. Using the 
conditioning values (age=65, retire=1, hstatusg=1, hhincome=60, 
educyear=17, married=1, hisp=0), estimate and compare the ME of 
age on the Pr(ins=1 |x), using both the margins and prchange 
commands. They should give the same result. 

6. Using the hetprobit command, estimate the parameters of the model 
of section 17.4, using hhincome as the sole variable determining the 
variance. Test the null hypothesis of homoskedastic probit. 

7. Using the example in section 17.10 as a template, estimate a grouped 
logistic regression using educyear as the grouping variable. Comment 
on what you regard as unsatisfactory features of the grouping variable 
and the results. 


Chapter 18 
Multinomial models 


18.1 Introduction 


Categorical data are data on a dependent variable that can fall into one of 
several mutually exclusive categories. Examples include different ways to 
commute to work (by car, by bus, or on foot) and different categories of 
self-assessed health status (excellent, good, fair, or poor). 


The econometrics literature focuses on modeling a single outcome from 
categories that are mutually exclusive, where the dependent variable 
outcome must be multinomial distributed, just as binary data must be 
Bernoulli or binomial distributed. Analysis is not straightforward, however, 
because there are many different models for the probabilities of the 
multinomial distribution. These models vary according to whether the 
categories are ordered or unordered, whether some of the individual- 
specific regressors vary across the alternative categories, and, in some 
settings, whether the model is consistent with utility maximization. 
Furthermore, parameter coefficients for any given model can be difficult to 
directly interpret. The marginal effects (MEs) of interest measure the impact 
on the probability of observing each of several outcomes rather than the 
impact on a single conditional mean. 


We begin with models for unordered outcomes, notably, multinomial 
logit (MNL), conditional logit (CL), nested logit (NL), multinomial probit 
(MNP), and alternative-specific random parameter logit models. We then 
move to models for ordered outcomes, such as health-status measures, and 
models for multivariate multinomial outcomes. 


18.2 Multinomial models overview 


We provide a general discussion of multinomial regression models. 
Subsequent sections detail the most commonly used multinomial regression 
models that correspond to particular functional forms for the probabilities of 
each alternative. 


18.2.1 Probabilities and MEs 


The outcome, yi, for individual ; is one of m alternatives. We set y; = 7 if 
the outcome is the jth alternative, j = 1,2,...,m. The values 1, 2,...,m 
are arbitrary, and the same regression results are obtained if, for example, we 
use values 3, 5, 8, .... The ordering of the values also does not matter, unless 
an ordered model (presented in section 18.9) is used. 


The probability that the outcome for individual ; is alternative j, 
conditional on the regressors X;, is 


where different functional forms, F;(-), correspond to different multinomial 
models. Only m — 1 of the probabilities can be freely specified because 
probabilities sum to one. For example, Fm(x;, 0) = 1 — ee F; (x4, 9). 
Multinomial models therefore require a normalization. Some Stata 
multinomial commands, including cmclogit, permit different individuals to 
face different choice sets so that, for example, an individual might be 
choosing only from among alternatives 1, 3, and 4. 


The parameters of multinomial models are generally not directly 
interpretable. In particular, a positive coefficient need not mean that an 
increase in the regressor leads to an increase in the probability of an outcome 
being selected. Instead, we compute MEs. For individual i, the ME of a change 
in the kth regressor on the probability that alternative j is the outcome is 


ô Pr(yi =j) OF; (xi, 8) 


ME vik = 
7 Otik Otik 


For each regressor, there will be m MEs corresponding to the m probabilities, 
and these m MEs sum to zero because probabilities sum to one. As for other 
nonlinear models, these MEs vary with the evaluation point x. 


18.2.2 Maximum likelihood estimation 


Estimation is by maximum likelihood (ML). We use a convenient form for 
the density that generalizes the method used for binary outcome models. The 
density for the jth individual is written as 


TIL. 
flys) = pi? esn e 
j=l 


where Yi1,---, Yim are m indicator variables with y;; = lif y; = j and 

Yij = 0 otherwise. For each individual, exactly one of y1, Y2,---,Ym will be 
nonzero. For example, if y; = 3, then y;3 = 1, the other yi; = 0, and upon 
simplification, f(y;) = pi3, as expected. 


The likelihood function for a sample of y independent observations is 
the product of the N densities, so L = H Iha p . The maximum 
likelihood estimator (MLE), g, maximizes the log-likelihood function 


N m 
In L(0) = X X yy n F; (xi, 0) (18.2) 


i=1 j=l 


and as usual, default standard errors are based on 
0 ~ N(0,[|—E{0? In L(0)/3080' 471). 


For multinomial models, the pseudo- R? has a meaningful interpretation; 
see section 13.8. Nonnested models can be compared by using the Akaike 
information criterion and related measures, assuming correct model 
specification. 


For multinomial data, the key is specification of F;(x,;, 0). Various 
models for F’;(-) are presented in this chapter, with the suitability of a 
particular model depending on the application at hand. 


18.2.3 Robust standard errors 


For categorical data, the distribution is necessarily multinomial. Similar to 
the case for binary outcomes, in theory there is no advantage in using the 
robust sandwich form for the variance—covariance matrix of the estimator 
(VCE) of the MLE in the special case that data are independent over ¿į and 
F; (x;, 0) is correctly specified. 


In practice, F} (x;, 0) is likely to be misspecified. Then quasi-ML theory 
applies (see section 13.3.1), and we need to use robust standard errors that 
provide consistent estimates of the variance of g. So we generally obtain 
robust standard errors. For independent observations, we use the 
vce (robust) option, and for observations that are dependent because of 
clustering, we use the vce (cluster clustvar) option. 


18.2.4 Case-specific and alternative-specific regressors 


Some regressors, such as gender, do not vary across alternatives and are 
called case-specific or alternative-invariant regressors. Other regressors, 
such as price, may vary across alternatives and are called alternative-specific 
or case-varying regressors. 


The commands used for multinomial model estimation can vary 
according to the form of the regressors. In the simplest case, all regressors 
are case specific, and we use the mlogit command. 


In more complicated applications, some or all the regressors are 
alternative specific. Then one uses cm commands (choice-model commands) 


introduced in Stata 16. For example, the cmclogit command supplants the 
asclogit command. 


These different types of commands require data to be organized in 
different ways; see section 18.5.1. 


18.2.5 Additive random-utility model 


For unordered multinomial outcomes that arise from individual choice, 
econometricians favor models that come from utility maximization. This 
leads to multinomial models that are used much less in other branches of 
applied statistics. 


For individual ; and alternative j, we suppose that utility U;; is the sum 
of a deterministic component, V;;, that depends on regressors and unknown 
parameters and an unobserved random component €ij: 


Us = Vij + Eij (18.3) 


This is called an additive random-utility model (ARUM). We observe the 
outcome y; = j if alternative j has the highest utility of the alternatives. It 
follows that 
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Pr(U;; > Uik), forall k 
= Pr(Uik = Uij < 0), all k (18.4) 
= Prien =—e7 5 Vig —Ve)y alk 


Standard multinomial models specify that Vi; = x}, + 2;77;, where X; are 
alternative-specific regressors and Z; are case-specific regressors. Different 
assumptions about the joint distribution of €:1,---;€im lead to different 
multinomial models with different specifications for F;(x;,@) in (18.1). 
Because the outcome probabilities depend on the difference in errors, only 
m — 1 of the errors are free to vary, and similarly, only m — 1 of the 7; are 
free to vary. 


18.2.6 Stata multinomial model commands 


Table 18.1 summarizes Stata commands for the estimation of multinomial 


models. 


Table 18.1. Commands for multinomial models 


Model 


Command 


Multinomial logit 

Multinomial probit 

Nested logit 
Alternative-specific logit 
Alternative-specific probit 
Alternative-specific mixed logit 
Ordered logit and probit 


Ordered multilevel mixed 
Rank-ordered logit 
Rank-ordered probit 
Stereotype logit 
Bivariate probit 

Panel mixed logit 


mlogit, xtmlogit 

mprobit 

nlogit 

cmclogit (formerly asclogit), clogit 

cmmprobit (formerly asmprobit) 

cmmixlogit (formerly asmixlogit) 

ologit, oprobit, hetoprobit, xtologit, 
xtoprobit, ziologit, zioprobit 

meologit, meoprobit 

cmrologit (formerly rologit) 

cmroprobit (formerly asroprobit) 

slogit 

biprobit 

cmxtmixlogit 


The nlogit, cmclogit, clogit, cmmixlogit, cmmprobit, cmrologit, 
cmroprobit, cmmixlogit, and cmxtmixlogit commands require the data to 
be in long form. The remaining commands expect data to be in wide form. 
The independent variable lists for most but not all of these estimation 
commands allow factor variables. The margins postestimation command is 
available after all but the nlogit and cmrologit commands. 


18.3 Multinomial example: Choice of fishing mode 


We analyze data on individual choice of whether to fish using one of four 
possible modes: from the beach, the pier, a private boat, or a charter boat. 
One explanatory variable is case specific (income), and the others (price 
and crate [catch rate]) are alternative specific. 


18.3.1 Data description 


The data from Herriges and Kling (1999) are also analyzed in Cameron and 
Trivedi (2005, chap. 15). mus218hk.dta has the following data: 


* Read in dataset, and describe dependent variable and regressors 
. qui use mus218hk 


. describe 
Contains data from mus218hk.dta 

Observations: 1,182 A.C.Cameron & P.K.Trivedi 

(2022): Microeconometrics Using 
Stata, 2e 
Variables: 16 1 Sep 2020 16:38 
Variable Storage Display Value 
name type format label Variable label 

mode float %9.0g modetype Fishing mode 
price float %9.0g Price for chosen alternative 
crate float 29 .0g Catch rate for chosen alternative 
dbeach float %9.0g Beach mode chosen 
dpier float %9.0g Pier mode chosen 
dprivate float %9.0g Private boat mode chosen 
dcharter float %9.0g Charter boat mode chosen 
pbeach float %9.0g Price for beach mode 
ppier float %9.0g Price for pier mode 
pprivate float %9.0g Price for private boat mode 
pcharter float %9.0g Price for charter boat mode 
qbeach float %9.0g Catch rate for beach mode 
qpier float %9.0g Catch rate for pier mode 
qprivate float %9.0g Catch rate for private boat mode 
qcharter float %9.0g Catch rate for charter boat mode 
income float %9.0g Monthly income in thousands $ 


Sorted by: 


There are 1,182 observations, one per individual. The first three variables are 
for the chosen fishing mode with the variables mode, price, and crate 
being, respectively, the chosen fishing mode and the price and catch rate for 
that mode. The next four variables are mutually exclusive dummy variables 
for the chosen mode, taking on a value of 1 if that alternative is chosen and a 
value of 0 otherwise. The next eight variables are alternative-specific 
variables that contain the price and catch rate for each of the four possible 
fishing modes (the prefix q stands for quality; a higher catch rate implies a 
higher quality of fishing). These variables are constructed from individual 
surveys that ask not only about attributes of the chosen fishing mode but also 
about attributes of alternative fishing modes such as location that allow for 
determination of price and catch rate. The final variable, income, is a case- 
specific variable. The summary statistics follow: 


* Summarize dependent variable and regressors 
summarize, separator (0) 


Variable Obs Mean Std. dev. Min Max 
mode 1,182 3.005076 . 9936162 1 4 
price 1,182 52.08197 53.82997 1.29 666.11 
crate 1,182 . 3893684 . 5605964 0002 2.3101 
dbeach 1,182 .1133672 .3171753 (0) 1 
dpier 1,182 . 1505922 . 3578023 (0) 1 
dprivate 1,182 . 3536379 . 4783008 (0) 1 
dcharter 1,182 . 3824027 . 4861799 (0) 1 
pbeach 1,182 103.422 103.641 1.29 843.186 
ppier 1,182 103.422 103.641 1.29 843.186 
pprivate 1,182 55.25657 62.71344 2.29 666.11 
pcharter 1,182 84.37924 63.54465 27.29 691.11 
qbeach 1,182 .2410113 . 1907524 .0678 .5333 
qpier 1,182 . 1622237 . 1603898 .0014 .4522 
qprivate 1,182 .1712146 . 2097885 . 0002 . 7369 
qcharter 1,182 . 6293679 . 7061142 .0021 2.3101 
income 1,182 4.099337 2.461964 .4166667 12.5 


The variable mode takes on the values ranging from 1 to 4. On average, 
private and charter boat fishing are less expensive than beach and pier 
fishing. Beach and pier fishing, both close to shore with similar costs, have 
identical prices. The catch rate for charter boat fishing is substantially higher 
than for the other modes. 


The tabulate command gives the various values and frequencies of the 
mode variable. We have 


* Tabulate the dependent variable 
. tabulate mode 


Fishing 
mode Freq. Percent Cum. 
Beach 134 11.34 11.34 
Pier 178 15.06 26.40 
Private 418 35.36 61.76 
Charter 452 38.24 100.00 
Total 1,182 100.00 


The shares are roughly one-third fish from the shore (either beach or pier), 
one-third fish from a private boat, and one-third fish from a charter boat. 
These shares are the same as the means of dbeach, ..., dcharter given in the 
summarize table. The mode variable takes on a value from 1 to 4 (see the 
summary statistics), but the output of describe has a label, modet ype, that 
labels 1 as Beach, ..., 4 as Charter. This labeling can be verified by using 
the label list command. There is no obvious ordering of the fishing 
modes, so unordered multinomial models should be used to explain fishing- 
mode choice. 


18.3.2 Case-specific regressors 


Before formal modeling, it is useful to summarize the relationship between 
the dependent variable and the regressors. This is more difficult when the 
dependent variable is an unordered dependent variable. 


For the case-specific income variable, we could use the bysort 
mode: summarize income command. More compact output is obtained by 
instead using the table command. We obtain 


. * Table of income by fishing mode 
. table mode, stat(count income) stat(mean income) stat(sd income) 


Fishing mode 
Beach 
Pier 
Private 
Charter 
Total 


Number of nonmissing values 


134 
178 
418 
452 
1,182 


Mean 


4.051617 
3.387172 
4.654107 

3.8809 
4.099337 


Standard deviation 


2.50542 
2.340324 
2.777898 
2.050028 
2.461964 


On average, those fishing from the pier have the lowest income and those 
fishing from a private boat have the highest. 


18.3.3 Alternative-specific regressors 


The relationship between the chosen fishing mode and the alternative- 
specific regressor price is best summarized as follows: 


. * Table of fishing price by fishing mode 
. table (result) mode, stat(mean pbeach ppier pprivate pcharter) nformat(%6.0f) 


Beach Pier 


Fishing mode 


Private Charter 


Price for beach mode 36 31 
Price for pier mode 36 31 
Price for private boat mode 98 82 
Price for charter boat mode 125 110 


138 
138 
42 
71 


121 
121 
45 
75 


On average, individuals tend to choose the fishing mode that is the cheapest 
or second-cheapest alternative available for them. For example, for those 
choosing private, on average, the price of private boat fishing is 42, 
compared with 71 for charter boat fishing and 138 for beach or pier fishing. 


Similarly, for the catch rate, we have 


. * Table of fishing catch rate by fishing mode 


. table (result) mode, stat(mean qbeach qpier qprivate qcharter) nformat (%6.2f) 


Fishing mode 


Beach Pier Private Charter Total 
Catch rate for beach mode 0.28 0.26 0.21 0.25 0.24 
Catch rate for pier mode 0.22 0.20 0.13 0.16 0.16 
Catch rate for private boat mode 0.16 0.15 0.18 0.18 0.17 
Catch rate for charter boat mode 0.52 0.50 0.65 0.69 0.63 


The chosen fishing mode is not on average that with the highest catch rate. 
In particular, the catch rate is always highest on average for charter fishing, 
regardless of the chosen mode. Regression analysis can measure the effect of 
the catch rate after controlling for the price of the fishing mode. 


18.4 Multinomial logit model 


Many multinomial studies are based on datasets that have only case-specific 
variables because explanatory variables are typically observed only for the 
chosen alternative and not for the other alternatives. The simplest model is 
the MNL model because computation is simple and parameter estimates are 
easier to interpret than in some other multinomial models. 


18.4.1 The mlogit command 


The MNL model can be used when all the regressors are case specific. The 
MNL model specifies that 


p, — _SP;) 
7 SE pA 


T=] (18.5) 


where x; are case-specific regressors, here an intercept and income. Clearly, 
this model ensures that 0 < pij < land > Pij = 1. To ensure model 
identification, 3, is set to zero for one of the categories, and coefficients are 
then interpreted with respect to that category, called the base category. 


The mlogit command has the syntax 
mlogit depvar | indepvars | lif | [ in | [ weight | [ $ options ] 


where indepvars are the case-specific regressors and the default is to 
automatically include an intercept. The baseoutcome (#) option specifies the 
value of depvar to be used as the base category, overriding the Stata default 
of setting the most frequently chosen category as the base category. Other 
options include rrr to report exponentiated coefficients (ô rather than 8). 


The mlogit command requires that data be in wide form, with one 
observation per individual. This is the case here. 


18.4.2 Application of the mlogit command 


We regress fishing mode on an intercept and income, the only case-specific 
regressor in our dataset. There is no natural base category. The first category, 
beach fishing, is arbitrarily set to be the base category. We obtain 


* Multinomial logit with base outcome alternative 1 


. mlogit mode income, baseoutcome(1) nolog vce(robust) 


Multinomial logistic regression Number of obs = 1,182 
Wald chi2(3) = 34.13 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -1477.1506 Pseudo R2 = 0.0137 
Robust 
mode Coefficient std. err. Zz P>|zl [95% conf. interval] 
Beach (base outcome) 
Pier 
income -.1434029 . 0608337 -2.36 0.018 - . 2626348 -.024171 
_cons .8141503 . 2506076 3.25 0.001 . 3229684 1.305332 
Private 
income .0919064 .0421603 2.18 0.029 .0092737 . 174539 
_cons . 7389208 . 2017495 3.66 0.000 . 343499 1.134343 
Charter 
income - .0316399 .0424705 -0.74 0.456 -.1148805 .0516008 
_cons 1.341291 . 1971834 6.80 0.000 .9548192 1.727764 


The model fit is poor with pseudo- R2, defined in section 13.8.1, equal to 
0.014. Three sets of regression estimates are given, corresponding here to B» 


wor and 3 R because we used the normalization 3, = 0. 


Of the three coefficient estimates of income, two are statistically 
significant at the 0.05 level, but the results of such individual testing will 
vary with the omitted category. Instead, we should perform a joint test. 
Using a Wald test, we obtain 


. * Wald test of the joint significance of income 
. test income 


( 1) [Beach]o.income = 0 

( 2) [Pier]Jincome = 0 

( 3) ([Private]lincome = 0 

( 4) [CharterJincome = 0 
Constraint 1 dropped 


chi2( 3) 34.13 
Prob > chi2 0.0000 


Income is clearly highly statistically significant. Note that this test is 
identical to the overall test Wald chi2 (3) =34.13, which was included in the 
original regression output. 


18.4.3 Coefficient interpretation 


Coefficients in a multinomial model can be interpreted in the same way as 
binary logit model parameters are interpreted, with comparison being with 
the base category. 


This is a result of the MNL model being equivalent to a series of pairwise 
logit models. For simplicity, we set the base category to be the first category. 
Then the MNL model defined in (18.5) implies that 


Pr(y; = 7) — exp(x;6;) 
Pr(yi = j) + Pr(yi=1) 1+ exp(x;G;) 


Pri“ = jl = 7 or 1s 


using 3, = 0 and cancellation of $772} exp(x/3,) in the numerator and 
denominator. 


Thus, B, can be viewed as parameters of a binary logit model between 


alternative j and alternative 1. So a positive coefficient from mlogit means 
that as the regressor increases, we are more likely to choose alternative j 
than alternative 1. This interpretation will vary with the base category and is 
clearly most useful when there is a natural base category. 


Some researchers find it helpful to transform to odds ratios or relative- 
risk ratios, as in the binary logit case. The odds ratio or relative-risk ratio of 
choosing alternative j rather than alternative 1 is given by 


Pr(y; = J) 


Pry 2 TRA) (18.6) 


SO e;r gives the proportionate change in the relative risk of choosing 
alternative j rather than alternative 1 when zir changes by one unit. 


The rrr option of mlogit provides coefficient estimates transformed to 
relative-risk ratios. We have 


. * Relative-risk option reports exp(b) rather than b 
. mlogit mode income, rrr baseoutcome(1) nolog vce(robust) 


Multinomial logistic regression Number of obs = 1,182 
Wald chi2(3) = 34.13 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -1477.1506 Pseudo R2 = 0.0137 
Robust 
mode RRR std. err. Zz P>lz| [95% conf. interval] 
Beach (base outcome) 
Pier 
income . 8664049 .0527066 -2.36 0.018 . 7690227 .9761187 
_cons 2.257257 . 5656857 3.25 0.001 1.381222 3.688914 
Private 
income 1.096262 .0462187 2.18 0.029 1.009317 1.190697 
_cons 2.093675 . 4223978 3.66 0.000 1.409872 3.109129 
Charter 
income . 9688554 .0411478 -0.74 0.456 .8914726 1.052955 


_cons 3.823979 . 754025 6.80 0.000 2.598201 5.628054 


Note: _cons estimates baseline relative risk for each outcome. 


. estimates store MNL 


Thus, a one-unit increase in income, corresponding to a $1,000 monthly 
increase, leads to a relative odds of choosing to fish from a pier rather than 


the beach that are 0.866 times what the relative odds were before the change; 
so the relative odds has declined. The original coefficient of income for the 
alternative pier was — 0.1434 and e—9-1434 — 0,8664: 


18.4.4 Predicted probabilities 


After most estimation commands, the predict command creates one 
variable. After mlogit, however, m variables are created, where m is the 
number of alternatives. Predicted probabilities for each alternative are 
obtained by using the pr option of predict. 


Here we obtain four predicted probabilities because there are four 
alternatives. We have 


. * Predict probabilities of choice of each mode, and compare to actual freqs 
. predict pmlogiti pmlogit2 pmlogit3 pmlogit4, pr 


. summarize pmlogit* dbeach dpier dprivate dcharter, separator (4) 


Variable Obs Mean Std. dev. Min Max 
pmlogiti 1,182 . 1133672 .0036716 .0947395 . 1153659 
pmlogit2 1,182 . 1505922 .0444575 .0356142 . 2342903 
pmlogit3 1,182 . 3536379 .0797714 . 2396973 .625706 
pmlogit4 1,182 . 3824027 .0346281 . 2439403 .4158273 

dbeach 1,182 . 1133672 . 3171753 (0) 1 

dpier 1,182 . 1505922 . 3578023 (0) 1 
dprivate 1,182 . 3536379 . 4783008 (0) 1 
dcharter 1,182 . 3824027 . 4861799 (0) 1 


Note that the sample average predicted probabilities equal the observed 
sample frequencies. This is always the case for MNL models that include an 
intercept, generalizing the similar result for binary logit models. 


The ideal multinomial model will predict perfectly. For example, p1 
ideally would take on a value of 1 for the 134 observations with y = 1 and 
would take on a value of 0 for the remaining observations. Here p1 ranges 
only from 0.0947 to 0.1154, so the model with income as the only 
explanatory variable predicts beach fishing very poorly. There is 
considerably more variation in predicted probabilities for the other three 
alternatives. 


The margins command (see section 13.7) can be used to compute the 
average predicted probability of a given outcome, along with an associated 
confidence interval. For example, for the third outcome, we have 


. * Sample average predicted probability of the third outcome 
. margins, predict (outcome(3)) noatlegend 


Predictive margins Number of obs = 1,182 
Model VCE: Robust 


Expression: Pr(mode==Private), predict (outcome(3)) 


Delta-method 
Margin std. err. z P>|zl [95% conf. interval] 


_cons . 3536379 .0137189 25.78 0.000 . 3267493 . 3805265 


18.4.5 MEs 


For an unordered multinomial model, there is no single conditional mean of 
the dependent variable, y. Instead there are m alternatives, and we model the 
probabilities of these alternatives. Interest lies in how these probabilities 
change as regressors change. 


For the MNL model, the MEs can be shown to be 


Opi; Zs 
Dx, = pi; (B; — B;) 


where 3, = © | Pit, 1s a probability weighted average of the G,. The MEs 
vary with the point of evaluation, x;, because Pij varies with x;. The signs of 
the regression coefficients do not give the signs of the mes. For a variable Xj, 
, the ME is positive if 6;, > bir 


The margins, dydx() command calculates the average marginal effect 
(AME), the marginal effect at the mean (MEM), and the marginal effect at 
representative values (MER). By default, the MEs are computed for all 
outcomes. In the following examples, we compute MEs for a single outcome. 


For example, to obtain the AME on Pr(y = 3) of a change in income, we 
use predict (outcome (3) ). We obtain 


. * AME of income change for outcome 3 
. margins, dydx(*) predict (outcome(3)) noatlegend 


Average marginal effects Number of obs = 1,182 
Model VCE: Robust 


Expression: Pr(mode==Private), predict (outcome (3) ) 
dy/dx wrt: income 


Delta-method 
dy/dx std. err. z P>lz| [95% conf. interval] 


income .0317562 .0052485 6.05 0.000 .0214694 . 0420429 


Averaged across all individuals, a one-unit change in income, equivalent to a 
$1,000 increase in monthly income, increases by 0.0318 the probability of 
fishing from a private boat rather than from a beach, pier, or charter boat. 


To instead obtain the ME evaluated at the sample mean of regressors, we 
add the atmeans option to obtain 


* MEM of income change for outcome 3 
. margins, dydx(*) predict(outcome(3)) atmeans noatlegend 


Conditional marginal effects Number of obs = 1,182 
Model VCE: Robust 


Expression: Pr(mode==Private), predict (outcome (3) ) 
dy/dx wrt: income 


Delta-method 
dy/dx std. err. Zz P>lz| [95% conf. interval] 


income .0325985 .00567 5.75 0.000 .0214856 .0437115 


For the average individual, a one-unit change in income increases by 0.0326 
the probability of fishing from a private boat rather than from the other 
fishing sites. The AME and MEM are quite similar in this example. Usually, 
there is greater difference in these ME measures following mlogit estimation. 


18.5 Alternative-specific conditional logit model 


Some multinomial studies use richer datasets that include alternative-specific 
variables, such as prices and quality measures for all alternatives, not just the 
chosen alternative. The CL model is used with these data. 


18.5.1 Creating long-form data from wide-form data 


The parameters of cL models are fit with commands that require the data to 
be in long form, with one observation providing the data for just one 
alternative for an individual. 


Some datasets will already be in long form, but that is not the case here. 
Instead, mus218hk.dta is in wide form, with one observation containing data 
for all four alternatives for an individual. For example, 


. * Data are in wide form 
. list mode price pbeach ppier pprivate pcharter in 1, clean 


mode price  pbeach ppier pprivate pcharter 
1. Charter 182.93 157.93 157.93 157.93 182.93 


The first observation has data for the price of all four alternatives. The 
chosen mode was charter, SO price was set to equal pcharter. 


To convert data from wide form to long form, we use the reshape 
command, introduced in section 8.10. Here the long form will have four 
observations for each individual according to whether the suffix is beach, 
pier, private, Or charter. These suffixes are strings, rather than the 
reshape command’s default of numbers, so we use reshape with the string 
option. For completeness, we actually provide the four suffixes. 


. * Convert data from wide form to long form 


. generate id = _n 


. reshape long d p q, i(id) j(fishmode beach pier private charter) string 


Data Wide -> Long 
Number of observations 1,182 -> 4,728 
Number of variables 22 => 14 
j variable (4 values) ->  fishmode 
xij variables: 
dbeach dpier ... dcharter -> d 
pbeach ppier ... pcharter -> p 
qbeach qpier ... qcharter -> q 


. save mus218hklong, replace 


(file mus218hklong.dta not found) 


file mus218hklong.dta saved 


There are now four observations for the first individual or case. If we had 
not provided the four suffixes, the reshape command would have 
erroneously created a fifth alternative, rice, from price that like pbeach, 


ppier, pprivate, and pcharter also begins with the letter p. 


To view the resulting long-form data for the first individual case, we list 


the first four observations. 


. * List data for the first case after reshape 


. list in 1/4, clean noobs 


id fishmode mode price crate 
> _est_MNL pmlogiti pmlogit2 pmlogit3 
1 beach Charter 182.93 .5391 

> 1 . 1125092 .0919656 .4516733 
1 charter Charter 182.93 .5391 

> 1 . 1125092 0919656 .4516733 
1 pier Charter 182.93 5391 

> 1 . 1125092 0919656 .4516733 
1 private Charter 182.93 .5391 

> 1 . 1125092 0919656 -4516733 


The order is no longer pier, beach, private boat, and then charter boat. 


d P 
pmlogit4 

O 157.93 
. 3438518 

1 182.93 
. 3438518 

O 157.93 
. 3438518 

O 157.93 
. 3438518 


q 


.0678 


.5391 


.0503 


.2601 


income 


7 .083332 


7 .083332 


7 .083332 


7 .083332 


Instead, it is now beach, charter boat, pier, and then private boat because the 
observations are sorted in the alphabetical order of fishmoae. For this first 
observation, the outcome variable, a, equals 1 for charter boat fishing, as 


expected. The four separate observations on the alternative-specific 


variables, p and q, are the different values for price and quality for the four 


alternatives. 


All case-specific variables appear as a single variable that takes on the 
same value for the four outcomes. For income, this is no problem. But mode, 
price, and crate are misleading here. The mode variable indicates that for 
case 1, the fishing mode was mode=4 because in original wide form, this 
corresponded to charter boat fishing. But d=1 for the second observation of 
the first case because this corresponds to charter boat fishing in the reordered 
long form. It would be best to simply drop the misleading variables by 
typing drop mode price crate because these variables are not needed. 


18.5.2 The cmset command 


Most models with alternative-specific regressors are estimated using cm 
commands for choice models. These commands require first using the cmset 
command, which has the syntax 


cmset caseidvar altvar ge force | 


where caseidvar identifies the case (individual) and altvar identifies the 
alternatives (choice sets). 


For the current example, we have 


. * cmset before use of cm commands for alternative-specific regressors 
. cmset id fishmode 


Case ID variable: id 
Alternatives variable: fishmode 


The cmsummarize command provides data summary by chosen 
alternative. The following example provides the number of observations and 
the mean for the regressor variables. 


. * Summarize the data by chosen alternative specific 
cmsummarize p q income, choice(d) statistics(N mean) 


Statistics by chosen alternatives (d = 1) 
income is constant within case 


Summary statistics: N, Mean 
Group variable: _chosen_alternative (d = 1) 


_chosen_alternative p q income 


beach 134 134 134 
35.69949 .2791948 4.051617 


charter 452 452 452 
75.09694 .6914998 3.8809 


178 178 178 
30.57133 .2025348 3.387172 


pier 


418 418 418 
41.60681 .1775411 4.654107 


private 


1182 1182 1182 
52.08197 .3893684 4.099337 


Total 


The cmtab command provides tabulations rather than summary statistics. 
The cmchoiceset command provides sample sizes for the choice sets facing 
individuals and is useful when, unlike the current example, the data are 
unbalanced with different individuals facing different choice sets. The 
cmsample command lists the reasons why observations in a choice model 
were dropped from the estimation sample. 


18.5.3 The cmclogit command 


When some or all regressors are alternative specific, the CL model is used. 
The cL model specifies that 


exp(xj;8 + 2/75) 


Dip =m j=1,..., m (18.7) 
. Ss exp(x/,3 + 271) 


where Xi; are alternative-specific regressors and Z; are case-specific 
regressors. To ensure model identification, we set one of the Y; to zero, as 
with the MNL model. Some authors call the model above a mixed logit 
model, with cL used to refer to a more restrictive model that has only 
alternative-specific regressors. 


The cmclogit command, an acronym for choice-model (or alternative- 
specific) conditional logit, has the syntax 


cmclogit depvar | indepvars | [ of | [ in | | weight | |; options | 


where indepvars are the alternative-specific regressors. The identifier for 
each case or individual and the possible alternatives have already been 
specified in the cmset command. 


The casevars () option specifies case-specific variables. The 
basealternative() option specifies the alternative that is to be used as the 
base category, which affects only the coefficients of case-specific regressors. 
The altwise option deletes only the data for an alternative, rather than the 
entire observation, if data are missing. 


The noconstant option overrides the Stata default of including case- 
specific intercepts. Attributes of each alternative are then explained solely by 
alternative-specific regressors if noconstant is used. The case-specific 
intercepts provided by the default estimator are interpreted as reflecting the 
desirability of each alternative because of unmeasured attributes of the 
alternative. 


The cmclogit command allows the choice set to vary across individuals 
and allows more than one alternative to be selected. 


18.5.4 Application of the cmclogit command 


We estimate the parameters of the CL model to explain fishing-mode choice 
given alternative-specific regressors on price and quality; the case-specific 
regressor, income; and case-specific intercepts. As for the MNL model, beach 
fishing is set to be the base category. We have 


* Conditional logit with alternative-specific and case-specific regressors 
. cmclogit d p q, casevars(income) basealternative(beach) nolog vce(robust) 


Conditional logit choice model 
Case ID variable: id 


Alternatives variable: 


fishmode 


Log pseudolikelihood = -1215.1376 


Number of obs = 
Number of cases = 


Alts per case: min = 


avg = 
max = 

Wald chi2(5) 

Prob > chi2 = 


4,728 
1182 


4 
4.0 
4 


178.32 
0.0000 


(Std. err. adjusted for clustering on id) 
Robust 
d Coefficient std. err. Zz P>|z| [95% conf. interval] 
fishmode 
p -.0251166 .0023261 -10.80 0.000 -.0296757 -.0205575 
q .357782 . 1173829 3.05 0.002 .1277157 . 5878482 
beach (base alternative) 
charter 
income -.0332917 .0493558 -0.67 0.500 -.1300273 . 0634438 
_cons 1.694366 . 2206146 7.68 0.000 1.261969 2.126762 
pier 
income -.1275771 .0547194 -2.33 0.020 - . 2348252 -.020329 
_cons . 7779593 . 2311108 3.37 0.001 . 3249905 1.230928 
private 
income . 0894398 .047773 1.87 0.061 -.0041936 . 1830732 
_cons .5272788 .2106221 2.50 0.012 .1144671 . 9400905 


. estimates store CL 


The first set of estimates are the coefficients B for the alternative-specific 

regressors price and quality. The next three sets of estimates are for the case- 
specific intercepts and regressors. The coefficients are, respectively, F eharter» 
Ypiers and Y private because we used the normalization Ypeach = 0. 


The output header does not give the pseudo- R?, but this can be 
computed by using the formula given in section 13.8.1. Here 


In Lg, = —1215.1, and estimation of an intercepts-only model yields 


In Lo = —1497.7, so R2? = 1 — (—1215.1)/(—1497.7) = 0.189, much 
higher than the 0.014 for the MNL model in section 18.4.2. The regressors p, 
q, and income are highly jointly statistically significant with wala 


chi2(5)=178. The test command can be used for individual Wald tests, or 
the 1rtest command can be used for likelihood-ratio (LR) tests if estimation 
uses default standard errors. 


The CL model in this section reduces to the MNL model in section 18.4.2 if 
Py = 0 and By = 0. Using either a Wald test or an LR test, this hypothesis is 
strongly rejected, and the CL model is the preferred model. 


18.5.5 Relationship to MNL model 


The MNL and CL models are essentially equivalent. The mlogit command is 
designed for case-specific regressors and data in wide form. The cmclogit 
command is designed for alternative-specific regressors and data in long 
form. 


The parameters of the MNL model can be fit by using cmclogit as the 
special case with no alternative-specific regressors. Thus, 


. * MNL is CL with no alternative-specific regressors 
. cmclogit d, casevars(income) basealternative(beach) vce(robust) nolog 


(output omitted ) 


yields the same estimates as the earlier mlogit command. When all 
regressors are case specific, it is easiest to use mlogit with data in wide 
form. 


Going the other way, one can estimate the parameters of a CL model 
using mlogit. This is more difficult because it requires transforming 
alternative-specific regressors to deviations from the base category and then 
imposing parameter-equality constraints. For CL models, cmclogit is much 
easier to use than mlogit. 


18.5.6 Coefficient interpretation 


Coefficients of alternative-specific regressors are easily interpreted. The 
alternative-specific regressor can be denoted by xr with the coefficient £,.. 
The effect of a change in £rik, which is the value of x, for individual į and 
alternative ķ, is 


Opiij l Pudl = py) ir J=k 
OL rik —PijPik br Jk 


If 8, > 0, then the own-effect is positive because p;;(1 — pij)6r > 0, and 
the cross-effect is negative because — PijPikbr < 0. So a positive coefficient 
means that if the regressor increases for one category, then that category is 
chosen more and other categories are chosen less, and vice versa for a 
negative coefficient. Here the negative price coefficient of — 0.025 means 
that if the price of one mode of fishing increases, then demand for that mode 
decreases, and demand for all other modes increases, as expected. For catch 
rate, the positive coefficient of 0.358 means a higher catch rate for one mode 
of fishing increases the demand for that mode and decreases the demand for 
the other modes. 


Coefficients of case-specific regressors are interpreted as parameters of a 
binary logit model against the base category; see section 18.4.3 for the MNL 
model. The income coefficients of — 0.033, — 0.128, and 0.089 mean that, 
relative to the probability of beach fishing, an increase in income leads to a 
decrease in the probability of charter boat and pier fishing and an increase in 
the probability of private boat fishing. 


18.5.7 Predicted probabilities 


Predicted probabilities can be obtained using the predict command with the 
pr option. This provides a predicted probability for each observation, where 
an observation is one alternative for one individual because the data are in 
long form. 


To obtain predicted probabilities for each of the four alternatives, we 
need to summarize by fishmode. We use the table command because this 
gives condensed output. Much lengthier output is obtained by instead using 
the bysort fishmode: summarize command. We have 


. * Predicted probabilities of choice of each mode and compare to actual freqs 
. predict pasclogit, pr 
. encode fishmode, generate(fishmode2) 


. table fishmode2, stat(mean d pasclogit) stat(sd pasclogit) nototal 
> nformat (7/6 .4f ) 


Mean Standard deviation 
d Pr(fishmodel|1 selected) Pr(fishmode|1 selected) 


fishmode2 
beach 0.1134 0.1134 0.1285 
charter 0.3824 0.3824 0.1566 
pier 0.1506 0.1506 0.1614 
private 0.3536 0.3536 0.1665 


As for MNL, the sample average predicted probabilities are equal to the 
sample probabilities. The standard deviations of the cL model predicted 
probabilities (all in excess of 0.10) are much larger than those for the MNL 
model, so the CL model predicts better. A summary is also provided by the 
estat alternatives command. 


A quite different predicted probability is that of a new alternative. This is 
possible for the CL model if the parameters of that model are estimated using 
only alternative-specific regressors, which requires use of the noconstant 
option so that case-specific intercepts are not included, and the values of 
these regressors are known for the new category. 


For example, we may want to predict the use of a new mode of fishing 
that has a much higher catch rate than the currently available modes but at 
the same time has a considerably higher price. The parameters, 6, in (18.7) 
are estimated with m alternatives, and then predicted probabilities are 
computed by using (18.7) with m + 1 alternatives. 


18.5.8 MEs 


The usual margins command is available after the cmclogit command and 
can be used to obtain the AME, MEM, and MER. Options for this command 
include the option alternatives (), which provides margins and MEs for 
only the specific alternatives listed. 


We compute for all alternatives the AME for just the regressor price. We 


obtain 


. * AME of change in price 
. margins, dydx(p) 


Average marginal effects 


Model VCE: Robust 


Expression: Pr(fishmode|1 selected), predict () 


dy/dx wrt: p 


_outcome# 


fishmode 
beach#beach 


beach # 


charter 
beach#pier 
beach 
private 
charter 
beach 
charter 
charter 
charter#pier 
charter 
private 
pier#beach 
pier#charter 
pier#pier 
pier#private 
private 
beach 
private 
charter 
private#tpier 
private 
private 


dy/dx 


-.0021102 


. 0006469 
. 0009075 


. 0005558 
. 0006469 


-.0053165 
.0009157 


. 0037539 
. 0009075 
.0009157 
-.0025593 
.0007361 


. 0005558 


. 0037539 
.0007361 


-.0050457 


Delta-method 
std. err. 


.0002213 


.0000619 
. 0001373 


. 0000495 


.0000619 


. 0004529 
. 0000767 


. 0003704 
.0001373 
. 0000767 
. 0002376 
. 0000608 


. 0000495 


. 0003704 
. 0000608 


. 0004295 


10. 


=11; 
11. 


10. 
6. 
11. 
-10. 
12. 


11. 


10. 
12. 


={1; 


P>|z| 


. 000 


. 000 
. 000 


. 000 


. 000 


. 000 
. 000 


. 000 
. 000 
. 000 
. 000 
. 000 


. 000 


. 000 
. 000 


. 000 


[95% conf. 


-.0025439 


. 0005257 
. 0006384 


. 0004588 


. 0005257 


- . 0062042 
. 0007654 


. 0030279 
. 0006384 
.0007654 
- . 0030249 
.0006169 


. 0004588 


. 0030279 
.0006169 


-.0058875 


Number of obs = 4,728 


interval] 


-.0016765 


. 0007682 
.0011765 


. 0006527 
. 0007682 


- . 0044288 
.001066 


. 0044798 
.0011765 
.001066 
-. 0020936 
. 0008553 


. 0006527 


. 0044798 
. 0008553 


- . 0042039 


There are 16 Mes in all, corresponding to probabilities of four alternatives 
times prices for each of the four alternatives. All own-effects are negative 
and all cross-effects are positive, as explained in section 18.5.6. The first ME 
given in the output takes value — 0.00211, which means that on average a $1 
increase in the price of beach fishing decreases the probability of beach 
fishing by 0.00211. The second value of 0.000647 means that a $1 increase 


in the price of charter boat fishing increases beach fishing probability by 
0.000647, and so on. 


18.5.9 The clogit command 


The cL model can also be fit by using the clogit command, yielding the 
same results. The clogit command is designed for grouped data used in 
matched case-control group studies and is similar to the xt logit command 
used for panel data grouped over time for an individual. 


The clogit command does not have an option for case-specific 
variables. Instead, a case-specific variable is interacted with dummies for 
m — 1 alternatives, and the m — 1 variables are entered as regressors. For 
applications such as the one studied in this chapter, cmclogit is easier to use 
than clogit. 


18.5.10 Rank-ordered logit 


The CL model is used when one out of m mutually exclusive alternatives is 
chosen. An alternative form of data is one where the m alternatives, or a 
subset of the m alternatives, are rank ordered. 


Then one can fit an m-alternative CL model for the highest-ranked 
alternative, an (m — 1)-alternative CL model for the second-ranked 
alternative, and so on. The cmrologit command implements this procedure. 
The additional ranking data may lead to more precise estimates, though it is 
being assumed that the same underlying parametric model applies for the 
preferred alternative, the second preferred alternative, and so on. 


18.6 Nested logit model 


The MNL and CL models are the most commonly used multinomial models, 
especially in other branches of applied statistics. However, in 
microeconometrics applications that involve individual choice, the models 
are viewed as placing restrictions on individual decision making that are 
unrealistic, as explained below. 


The simplest generalization is an NL model. Two variants of the NL model 
are used. The preferred variant is one based on the ARUM. This is the model 
we present and is the default model for Stata 17. A second variant was used 
by most packages in the past, including Stata before version 10. Both 
variants have MNL and CL as special cases, and both ensure that multinomial 
probabilities lie between 0 and 1 and sum to 1. But the variant based on 
ARUM is preferred because it is consistent with utility maximization. 


18.6.1 Relaxing the independence of irrelevant alternatives assumption 


The MNL and CL models impose the restriction that the choice between any 
two pairs of alternatives is simply a binary logit model; see (18.6). This 
assumption, called the independence of irrelevant alternatives (IIA) 
assumption, can be too restrictive, as illustrated by the “red bus/blue bus” 
problem. Suppose commute-mode alternatives are car, blue bus, or red bus. 
The IIA assumption is that the probability of commuting by car, given 
commuting by either car or red bus, is independent of whether commuting 
by blue bus is an option. But the introduction of a blue bus, same as a red 
bus in every aspect except color, should have little impact on car use and 
should halve use of red bus, leading to an increase in the conditional 
probability of car use given commuting by car or red bus. 


This limitation has led to alternative richer models for unordered choice 
based on the ARUM introduced in section 18.2.5. The MNL and CL models can 
be shown to arise from the ARUM if the errors, €:;, in (18.3) are independent 
and identically distributed as a type I extreme value. Instead, in the red 
bus/blue bus example, we expect the blue bus error, €;2, to be highly 


correlated with the red bus error, €:3, because if we overpredict the red bus 
utility given the regressors, then we will also overpredict the blue bus utility. 


More general multinomial models, presented in this and subsequent 
sections, allow for correlated errors. The NL is the most tractable of these 
models. 


18.6.2 NL model 


The NL model requires that a nesting structure be specified that splits the 
alternatives into groups, where errors in the ARUM are correlated within 
group but are uncorrelated across groups. We specify a two-level NL model, 
though additional levels of nesting can be accommodated, and assume a 
fundamental distinction between shore and boat fishing. The tree is 


Mode 
7 ba 
Shore Boat 
we NG Je % 
Beach Pier Charter Private 


The shore/boat contrast is called level 1 (or a limb), and the next level is 
called level 2 (or a branch). The tree can be viewed as a decision tree—first 
decide whether to fish from shore or boat, and then decide between beach 
and pier (if shore) or between charter and private (if boat). But this 
interpretation of the tree is not necessary. The key is that the NL model 
permits correlation of errors within each of the level-2 groupings. Here 

(Ei beach, Ei,pier ) are a bivariate correlated pair, (€i private; €i,charter) are a 
bivariate correlated pair, and the two pairs are independent. The CL model is 
the special case where all errors are independent. 


More generally, denote alternatives by subscripts (j, k), where j denotes 
the limb (level 1) and k denotes the branch (level 2) within the limb, and 
different limbs can have different numbers of branches, including just one 
branch. For example, (2,3) denotes the third alternative in the second limb. 
The two-level random utility is defined to be 


Un + jk = 2j04+%, 8; +ejr, Jal,...,j, k=1,...,K; 


where Z; varies over limbs only and Xj varies over both limbs and 
branches. For ease of exposition, we have suppressed the individual 
subscript 7, and we consider only alternative-specific regressors. (If all 
regressors are instead case specific, then we have 2’a; + x'B,,, + Ejk with 
one of the 6; = 0.) The NL model assumes that (€j1,...,€j;K,) are 
distributed as Gumbel’s multivariate extreme-value distribution. Then the 
probability that alternative (j, k) is chosen equals 


exp (zc + 7;1;) ? exp (x}.8,/73 


Pjk = Pj X Pkjj = J j 
Se exp (zl, + Trl) Sad exp (xB) 


where J; = In ao exp(x;)8;/ ri) is called the inclusive value or the 


log sum. The NL probabilities are the product of probabilities Pj and Pkļj, 
which are essentially of cL form. The model produces positive probabilities 
that sum to one for any value of 73, called dissimilarity parameters. But the 
ARUM restricts 0 < T; < 1, and values outside this range mean the model, 
while mathematically correct, is inconsistent with random-utility theory. 


18.6.3 The nlogit command 


The Stata commands for NL have complicated syntax that we briefly 
summarize. It is simplest to look at the specific application in this section 
and see [CM] nlogit for further details. 


The first step is to specify the tree structure. The nlogitgen command 
has the syntax 


nlogitgen newaltvar = altvar (branchlist) |, [ no | log | 


The altvar variable is the original variable defining the possible alternatives, 
and newaltvar is a created variable necessary for nlogit to know what 
nesting structure should be used. Here branchiist is 


branch, branch ie branch || 


and branch is 


| label: | alternative [| alternative [ | alternative oral 


There must be at least two branches, and each branch has one or more 
alternatives. 


The nesting structure can be displayed by using the nlogittree 
command with the syntax 


nlogittree altvarlist [ of | [ in | | weight | ie options | 


A useful option is choice (depvar) , which lists sample frequencies for each 
alternative. 


Estimation of model parameters uses the nlogit command with the 
syntax 


nlogit depvar [ indepvars | [ of | [ in | | weight | [11 lev1_equation [11 lev2_equation zah] || 
altvar : | byaltvarlist | , case (varname) | options 


where indepvars are the alternative-specific regressors and case-specific 
regressors are introduced in /ev# equation. The syntax of lev# equation is 


altvar : | byaltvarlist | E base(# | lbl) estconst | 
case (varname) provides the identifier for each case (individual). 


The NL commands use data in long form, as did cmclogit. 
18.6.4 Model estimates 


We first define the nesting structure by using the nlogitgen command. Here 
we define a variable, type, that is called shore for the pier and beach 


alternatives and is called boat for the private and charter alternatives. 


* Define the tree for nested logit 
. nlogitgen type = fishmode(shore: pier | beach, boat: private | charter) 
New variable type is generated with 2 groups 
label list lb_type 
lb_type: 
1 shore 
2 boat 


The tree can be checked by using the nlogittree command. We have 


* Check the tree 
. nlogittree fishmode type, choice(d) 


Tree structure specified for the nested logit model 


type N fishmode N k 

shore 2364 -y~ beach 1182 134 
— pier 1182 178 

boat 2364 —— charter 1182 452 


— private 1182 418 


Total 4728 1182 


k = number of times alternative is chosen 
N = number of observations at each level 


The tree is as desired, so we are now ready to estimate with nlogit. 
First, list the dependent variable and the alternative-specific regressors. 
Then, define the level-1 equation for type, which here includes no 
regressors. Finally, define the level-2 equations that here have the regressors 
income and an intercept. We use the notree option, which suppresses the 
tree, because it was already output with the nlogittree command. We have 


. * Nested logit model estimate 


. nlogit d p q || type:, base(shore) || fishmode: income, case(id) notree 

> vce(robust) nolog 

note: the LR test for IIA will not be computed. 

RUM-consistent nested logit regression Number of obs = 4,728 

Case variable: id Number of cases = 1182 

Alternative variable: fishmode Alts per case: min = 4 
avg = 4.0 
max = 4 

Wald chi2(5) = 119.60 
Log pseudolikelihood = -1192.4116 Prob > chi2 = 0.0000 


(Std. err. adjusted for clustering on id) 


Robust 
d | Coefficient std. err. Z P>|z| [95% conf. interval] 
fishmode 
p -.0267478 .0026455 -10.11 0.000 -.0319328 -.0215628 
q 1.346875 . 2661627 5.06 0.000 .8252062 1.868545 


fishmode equations 


beach 
income O (base) 
_cons O (base) 
charter 
income -5.483238 12.83935 -0.43 0.669 -30.6479 19.68143 
_cons 48.44557 101.3858 0.48 0.633 -150.267 247.1581 
pier 
income -6.441663 13.31746 -0.48 0.629 -32.54341 19.66009 
_cons 39.98192 84.51765 0.47 0.636 -125.6696 205.6335 
private 
income -1.29216 2.208674 -0.59 0.559 -5.621081 3.036761 
_cons 28.30816 60.99204 0.46 0.643 -91.23404 147.8504 


dissimilarity parameters 


/type 
shore_tau 56.08891 124.0073 -186.9609 299.1387 
boat_tau 32.69379 88 . 47634 -140.7166 206.1042 


. estimates store NL 


The coefficient of variable p is little changed compared with the cL model, 
but the other coefficients changed considerably. 


The NL model reduces to the CL model if the two dissimilarity parameters 
are both equal to 1. The dissimilarity parameters are much greater than 1, 
albeit not statistically different from 1. This is not an unusual finding for NL 
models; it means that while the model is mathematically correct, with 
probabilities between 0 and 1 that add up to 1, the fitted model is not 
consistent with the ARUM. A joint Wald test that the two dissimilarity 
parameters both equal 1 has p = 0.646, so the CL model is not rejected. At 
the same time, if default standard errors are used, the output includes an LR 
test with p = 0.000 that strongly rejects the cL model. 


18.6.5 Predicted probabilities 


The predict command with the pr option provides predicted probabilities 
for level 1, level 2, and so on. Here there are two levels. The first-level 
probabilities are for shore or boat. The second-level probabilities are for 
each of the four alternatives. We have 


. * Predict level 1 and level 2 probabilities from NL model 
. predict plevelil plevel2, pr 


. tabulate fishmode, summarize(plevel2) 


Summary of Pr(fishmode 


alternatives) 
fishmode Mean Std. dev. Freq. 
beach . 11320342 . 13343115 1,182 
charter . 38065864 . 15748834 1,182 
pier . 15075492 . 16988406 1,182 
private . 35538302 . 1642526 1,182 


Total .25 . 19693631 4,728 


The average predicted probabilities for NL no longer equal the sample 
probabilities, but they are quite close. The variation in the predicted 
probabilities, as measured by the standard deviation, is essentially the same 
as that for the cL model predictions, given in section 18.5.7. 


18.6.6 MEs 


The nlogit command is one of very few commands for which the margins 
postestimation command, or the older estat mfx command, is not available. 
Instead, we compute the AMEs of a price change manually. We obtain 


. * AME of beach price change computed manually 
. preserve 


. qui summarize p 

. generate delta = r(sd)/1000 

. qui replace p = p + delta if fishmode == "beach" 
. predict pnewl pnew2, pr 

. generate dpdbeach = (pnew2 - plevel2)/delta 


. tabulate fishmode, summarize (dpdbeach) 


Summary of dpdbeach 


fishmode Mean Std. dev. Freq. 
beach -.00054265 . 00048508 1,182 
charter . 00063417 . 00054757 1,182 
pier - . 00064882 . 00057087 1,182 
private . 00055729 .00051243 1,182 
Total -1.047e-09 . 00079865 4,728 

. restore 


Compared with the cL model, there is little change in the ME of beach price 
change on the probability of charter and private boat fishing. But now, the 
probability of pier fishing falls in addition to the probability of beach 
fishing. 


18.6.7 Comparison of various MNL models 


The following table summarizes key output from fitting the preceding MNL, 
CL, and NL models. We have 


. * Summary statistics for the logit models 
. estimates table MNL CL NL, keep(p q) stats(N 11 aic bic) equation(1) 
> b(47.3f) stfimt (47 .0f) 


Variable MNL CL NL 
p -0.025 -0.027 
q 0.358 1.347 
N 1182 4728 4728 
11 -1477 -1215 -1192 
aic 2966 2446 2405 
bic 2997 2498 2469 


The information criteria, the Akaike information criterion and Bayesian 
information criterion, are presented in section 13.8.2; lower values are 
preferred. MNL is least preferred, and NL is most preferred. 


In this example, the three multinomial models are actually nested, so we 
can choose between them by using LR tests. From the discussion of the CL 
and NL models, NL is again preferred to CL, which in turn is preferred to MNL. 


All three models use the same amount of data. The cL and NL model 
entries have an N that is four times that for MNL because they use data in 
long form, leading to four “observations” per individual. 


18.7 Multinomial probit model 


The MnP model, like the NL model, allows relaxation of the IIA assumption. It has the 
advantage of allowing a much more flexible pattern of error correlation and does 
not require the specification of a nesting structure. 


18.7.1 MNP 


The MnP is obtained from the ARUM of section 18.2.5 by assuming normally 
distributed errors. 


For the ARUM, the utility of alternative j is 
Uig = X;jb + 2:9; + Eij 


where the errors are assumed to be normally distributed, with e ~ N(0, £), where 
E€ = (Ea, iets Eim). 


Then, from (18.4), the probability that alternative j is chosen equals 
Pij = Pr(yi = j) = Pr{£ik — Eij < (Xij — Xix) B + z;(Yj —Ye)}, for all k (18.8) 


This is an (m — 1)-dimensional integral for which there is no closed-form solution 
and computation is difficult. This problem did not arise for the preceding logit 
models because for those models, the distribution of € is such that (18.8) has a 
closed-form solution. 


When there are few alternatives, say, three or four, or when X = 92], 
quadrature methods can be used to numerically compute the integral. Otherwise, 
maximum simulated likelihood (MSL), discussed below, is used. 


Regardless of the method used, not all (m + 1)m/2 distinct entries in the error 
variance matrix, X, are identified. From (18.8), the model is defined for m — 1 error 
differences (Eip — €;;) with an (m — 1) x (m — 1) variance matrix that has 
m(m — 1)/2 unique terms. Because a variance term also needs to be normalized, 
there are only {m(m — 1)/2} — 1 unique terms in X. In practice, further 


restrictions are often placed on © because otherwise X is imprecisely estimated, 
which can lead to imprecise estimation of 68 and 7. 


18.7.2 The mprobit command 


The mprobit command is the analogue of mlogit. It applies to models with only 
case-specific regressors and assumes that the alternative errors are independent 
standard normal so that 5 — J. Here the (m — 1)-dimensional integral in (18.8) can 
be shown to reduce to a one-dimensional integral that can be approximated by using 
quadrature methods. 


There is little reason to use the mprobit command because the model is 
qualitatively similar to MNL; mprobit assumes that alternative-specific errors in the 
ARUM are uncorrelated, but it is much more computationally burdensome. The 
syntax for mprobit is similar to that for mlogit. For a regression with the 
alternative-invariant regressor income, the command is 


. * MNP with independent errors and alternative-invariant regressors 
. mprobit mode income, baseoutcome(1) 


(output omitted ) 


The output is qualitatively similar to that from mlogit, though parameter estimates 
are scaled differently, as in the binary model case. The fitted log likelihood is 
— 1,477.8, very close to the — 1,477.2 for MNL (see section 18.4.2). 


18.7.3 Maximum simulated likelihood 


The multinomial log likelihood is given in (18.2), where p;; = F(x;,2;, @) and the 
parameters @ are B, Y1,--->Ym (with one Y normalized to zero) and any 
unspecified entries in X. 


Because there is no closed-form solution for F} (x;,z;, 0) in (18.8), the log 
likelihood is approximated by a simulator, F, (Xi, Zi, 0), that is based on S draws. A 
simple example is a frequency simulator that, given the current estimate g, takes $ 
draws of e, ~ N(0, £) and lets F; (xi, Zi, 0) be the proportion of the 5 draws for 
which cip — cij < (Xij — xix)’ B+ 2; (FY; — 7p) for all k. This simulator is 
inadequate, however, because it is very noisy for low-probability events, and for the 
MNP model, the frequency simulator is nonsmooth in 68 and Y1,- - - , Ym so that very 
small changes in these parameters may lead to no change in Fj (Xi, Zi, 0) Instead, 


the Geweke—Hajivassiliou-Keane simulator—described, for example, in 
Train (2009)—is used. 


The MSL estimator maximizes 


N m 
In L(80) = ` > v5 In F; (Xi, Zi, 0) (18.9) 


i=1 j=1 


The usual ML asymptotic theory applies, provided that both S > œo and N > o, 
and VN /S —> 0 so that the number of simulations increases at a rate faster than 
VN. Even though default standard errors are fine for a correctly specified 
multinomial model, robust standard errors are numerically better when MSL is used. 


The MSL estimator can, in principle, be applied to any estimation problem that 
entails an unknown integral. Some general results are the following: Smooth 
simulators should be used. Even then, some simulators are much better than others, 
but this is model specific. When random draws are used, they should be based on 
the same underlying uniform seed at each iteration because otherwise the gradient 
method may fail to converge simply because of different random draws (called 
chatter). The number of simulations may be greatly reduced for a given level of 
accuracy by using antithetic draws, rather than independent draws, and by using 
quasi-random-number sequences such as Halton sequences rather than uniform 
pseudorandom draws to generate uniform numbers. The benefits of using Halton 
and Hammersley rather than uniform draws is exposited in Drukker and 
Gates (2006). And to reduce the computational burden of gradient methods, it is 
best to at least use analytical first derivatives. For more explanation, see, for 
example, Train (2009) or Cameron and Trivedi (2005, chap. 15). The cmmprobit 
command incorporates all of these considerations to obtain the MSL estimator for the 
MNP model. 


18.7.4 The cmmprobit command 


The cmmprobit command requires data to be in long form, like the cmclogit 
command, and it has similar syntax: 


cmmprobit depvar | indepvars | [ af | [ in | [ weight | [ j options | 
Estimation takes a long time because estimation is by MSL. 


Several of the command’s options are used to specify the error variance matrix 
>. As already noted, at most {m(m — 1)/2} — 1 unique terms in X are identified. 
The default identification method is to drop the row and column of } corresponding 


to the first alternative (except that X41 is normalized to 1) and to set Moo = 1. 
These defaults can be changed by using the basealternative() and 
scalealternative() options. The correlation() and stddev() options are used 
to place further structure on the remaining off-diagonal and diagonal entries of X. 
The correlation (unstructured) option places no structure, the 

correlation (exchangeable) option imposes equicorrelation, the 

correlation (independent) option sets >a = 0 for all are and the 
correlation (pattern) and correlation (fixed) options allow manual 
specification of the structure. The stddev (homoskedastic) option imposes Uj; = 1 
, the stddev (heteroskedastic) option allows b,; Æ 1, and the stddev (pattern) 
and stddev (fixed) options allow manual specification of any structure. 


Other options allow variations in MSL computations. intmethod() specifies 
whether the uniform numbers are from a Hammersley sequence 
(intmethod(hammersley) ), the default, from a Halton sequence 
(intmethod (halton) ), or from uniform pseudorandom draws 
(intmethod (random) ). The intpoints (S) option sets the number of integration 
points to $. By default, S = 500 + 2.5,/N{In(k + 5) + v}, where k is the number 
of coefficients and v is the number of covariance parameters. § is set to twice this 
value in the case of intmethod (random). In this latter case, one should additionally 
use the intseed() option, which sets the random-number generator seed. The 
antithetics option specifies antithetic draws to be used. This improves the 
accuracy. The favor (speed|space) option leads to favoring speed (the default) or 
space when generating the integration points. Other integration options are 
intburn(), intseed(), nopivot, and initbhhh(). 


Consistent MSL estimation requires that the number of evaluation points go to 
infinity. Final published work should set intpoints(#) to a value much higher than 
the default value. 


18.7.5 Application of the cmmprobit command 


For simplicity, we restricted attention to a choice between three alternatives: fishing 
from a pier, private boat, or charter boat. The most general model with unstructured 
correlation and heteroskedastic errors is used. We use the structural option 
because then the variance parameter estimates are reported for the m x m error 
variance matrix X rather than the (m — 1) x (m — 1) variance matrix of the 
difference in errors. We have 


* Multinomial probit with unstructured errors when charter is dropped 
qui use mus218hklong, clear 


cmset id fishmode 


Case ID variable: id 
Alternatives variable: fishmode 
drop if fishmode == "charter" 


(2,538 observations deleted) 


| mode == 


. cmmprobit d p q, casevars(income) correlation(unstructured) structural 
> vce(robust) nolog 
note: variable p has 106 cases that are not alternative-specific; 

within-case variability. 


Multinomial probit choice model 
Case ID variable: id 


Alternatives variable: 


Integration sequence: 
Integration points: 
Log simulated-pseudolikelihood = -482.29432 


fishmode 


Hammersley 


642 


Number of obs 
Number of cases 


Alts per case: min = 


avg 
max 


Wald chi2(4) 
Prob > chi2 


there is no 


(Std. err. adjusted for clustering on id) 
Robust 
d | Coefficient std. err. z P>|z| [95% conf. interval] 
fishmode 
p - .0233365 .011336 -2.06 0.040 .0455547 -.0011184 
q 1.398374 . 5362936 2.61 0.009 . 3472577 2.44949 
beach (base alternative) 
pier 
income -.0979907 0413102 -2.37 0.018 . 1789572 -.0170242 
_cons . 7547783 . 201365 3.75 0.000 . 3601101 1.149447 
private 
income .0412734 .0735513 0.56 0.575 . 1028846 . 1854314 
_cons 6605541 . 2757544 2.40 0.017 . 1200854 1.201023 
/lnsigma3 . 403742 . 4964333 0.81 0.416 . 5692495 1.376733 
/atanhr3_2 .1757771 . 2335785 0.75 0.452 . 2820283 . 6335826 
sigmal 1 (base alternative) 
sigma2 1 (scale alternative) 
sigma3 1.497418 . 743368 56595 3.961939 
rho3_2 . 1739889 . 2265076 . 2747813 .5605142 


(fishmode=beach is the alternative normalizing location) 


(fishmode=pier is the alternative normalizing scale) 


As expected, utility is decreasing in price and increasing in quality (catch rate). 


The base mode was automatically set to the first alternative, beach, so that the 
first row and column of © are set to 0, except X4; = 1. One additional variance 
restriction is needed, and here that is on the error variance of the second alternative, 
pier, with Noo = 1 (the alternative normalizing scale). With m = 3, there are 
(3 x 2)/2 — 1 = 2 free entries in X: one error variance parameter, S33, and one 
correlation, 032 = Cor(é;3, €;2). The sigma3 output is \/Y33, and the rho3_2 output 
1S P32. 


The estat covariance and estat correlation commands list the complete 
estimated variance matrix, $, and the associated correlation matrix. We have 


. * Show correlations and covariance 
. estat correlation 


beach pier private 
beach 1.0000 
pier 0.0000 1.0000 
private 0.0000 0.1740 1.0000 
. estat covariance 
beach pier private 
beach 1 
pier 0 1 
private 0 .260534 2.242259 


If the parameters of the model are instead estimated without the structural 
option, the same parameter estimates are obtained, aside from estimation error, but 
the covariances and correlation are given for the variance matrix of the bivariate 
distribution of €;2 — €;; and €;3 — € 41. 


18.7.6 Predicted probabilities and MEs 


The predict postestimation command with the default pr option predicts Pij, and 
predictive margins and MEs are obtained using the margins command. The 
commands are similar to those after cmclogit; see sections 18.5.7 and 18.5.8. 


18.7.7 Rank-ordered probit 


The rank-ordered probit model fits the MNP model when alternatives are rank 
ordered, rather than the more common case where only the most preferred 


alternative is known. 


Then one can fit an m-alternative MNP model for the highest-ranked alternative, 
an (m — 1)-alternative MNP model for the second-ranked alternative, and so on. The 
cmroprobit command implements this procedure. The additional ranking data may 
lead to more precise estimates, though it is being assumed that the same underlying 
parametric model applies for the preferred alternative, the second preferred 
alternative, and so on. 


18.8 Alternative-specific random-parameters logit 


The alternative-specific random-parameters logit (RPL) model, or mixed logit 
model, relaxes the IIA assumption by allowing the coefficients of alternative- 
specific regressors in the CL model to be normally distributed or lognormally 
distributed. Here we estimate the parameters of the models by using 
individual-level data. 


Quite different estimation procedures are used if the data are grouped, 
such as market share data; see section 18.8.4. 


18.8.1 Alternative-specific RPL 


The alternative-specific RPL model, or mixed logit model, is obtained from 
the ARUM of section 18.2.5 by assuming that the errors €i; are type II 
extreme-value distributed, like for the CL model, and the parameters G are 
normally distributed. Then the utility of alternative 7 is 


Uij = Xijbi + 245 + Eij 
=x, 0+ 2,7; + Xi vi t+ ei 
where 6; = B + v; and v; ~ N(O, Ug). The combined error 


(X;;Vi + ij) is now correlated across alternatives, whereas the errors Eij 
alone were not. 


Then, conditional on the unobservables v;, we have a CL model with 


exp(x;; 6 + Zi j + x; Vi) 


ee 
doar €XP(X1, 8 + Zi) + Xvi) 


Pij|Vi = 


The MLE is based on Pij, which also requires integrating out v;, a high- 
dimensional integral. 


The MSL estimator instead maximizes (18.9), where F; (Xi, Zi, 0) isa 
simulator for Pij. Here the frequency simulator that makes many draws of v; 
from the normal given current estimates of Xg is a smooth simulator. 


18.8.2 The cmmixlogit command 


The cmmixlogit command computes the MSL estimator. This supplants the 
community-contributed mixlogit command (Hole 2007) presented in the 
first edition of this book. The data first need to be cmset to provide 
identifiers for each case (individual) and each alternative. The syntax of the 
cmmixlogit command is 


cmmixlogit depvar | indepvars | [ of | lin | | weight | Is options | 


which is similar to that for cmmprobit. Alternative-specific regressors with 
nonrandom coefficients are listed as indepvars. Key options are casevars () 
for case-specific variables and random() to identify alternative-specific 
variables with random coefficients. The default is for the random 
coefficients to be independent normally distributed. Other options allow for 
random coefficients that are correlated normal, lognormal, truncated normal, 
uniform, or triangular distributed. 


The options for the MSL computations are similar to those for the 
cmmprobit command detailed in section 18.7.4. Consistent MSL estimation 
requires that the number of evaluation points go to infinity. Final published 
work should set intpoints() to a value much higher than the default value. 


18.8.3 Application of the cmmixlogit command 


We fit the same three-choice model as that used in section 18.7.5 for the MNP 
model, with charter fishing dropped. 


The parameters for p are specified to be random, using the random () 
option. All other parameters are specified to be fixed; the alternative-specific 
variable g appears in the initial variable list, while the case-specific variable 
income appears in the casevars() option. We have 


. * Alternative-specific mixed logit or random parameters logit estimation 
. qui use mus218hklong, clear 


. drop if fishmode == "charter" 


(2,538 observations deleted) 


. cmset id fishmode 


Case ID variable: id 
Alternatives variable: fishmode 


| mode == 


// caseidvar casevars 


. cmmixlogit dq, casevars(income) random(p) basealternative (pier) 
> vce(robust) nolog 


Mixed logit choice model Number of obs = 2,190 
Case ID variable: id Number of cases = 730 
Alternatives variable: fishmode Alts per case: min = 3 
avg = 3.0 
max = 3 
Integration sequence: Hammersley 
Integration points: 613 Wald chi2(4) = 28.40 
Log simulated-pseudolikelihood = -433.92078 Prob > chi2 = 0.0000 
(Std. err. adjusted for clustering on id) 
Robust 
d | Coefficient std. err. Zz P>lz| [95% conf. interval] 
fishmode 
q . 8633073 .8872554 0.97 0.331 -.8756813 2.602296 
p -.107416 .0287078 -3.74 0.000 -.1636823 -.0511497 
/Normal 
sd(p) 0595192 .0187898 .0320582 . 1105035 
beach 
income . 1203331 0519823 2.31 0.021 .0184497 . 2222165 
_cons -. 7802862 . 2304865 -3.39 0.001 -1.232031 -.328541 
pier (base alternative) 
private 
income . 1733836 .0773131 2.24 0.025 .0218526 . 3249146 
_cons -.2199922 . 318053 -0.69 0.489 - .8433647 . 4033802 


There is considerable variation across individuals in the effect of price. The 
random coefficients have a mean of — 0.1074 and a standard deviation of 
0.0595, both statistically significant at the 0.05 level. 


The preceding command used robust standard errors. If instead default 
standard errors are used, the output includes an LR test against a CL model 
that restricts the variance of 8p; to equal zero and strongly rejects this 


simpler model. From output not included, the simpler CL model had a log 
likelihood of — 466.8 compared with — 433.9 for the RPL model. This leads 
to an LR test statistic of 2 x [—433.9 — (—466.8)] = 65.8. 


MEs can be obtained using the margins command. For example, with 
respect to price changes, we find 


. * AMEs with respect to price 
. margins, dydx(p) 


Average marginal effects 


Model VCE: Robust 


Expression: Pr(fishmode), predict() 


dy/dx wrt: p 


Number of obs = 2,190 


Delta-method 


dy/dx std. err. Zz P>|z| [95% conf. interval] 
P 
_outcome# 
fishmode 
beach#beach -.0122611 .0032902 -3.73 0.000 -.0187099 -.0058123 
beach#pier .0097067 .0028752 3.38 0.001 .0040714 .0153421 
beach # 
private .0025544 . 0004443 5.75 0.000 .0016835 . 0034252 
pier#beach .0097067 .0028752 3.38 0.001 .0040714 .0153421 
pier#pier -.0131526 . 0033724 -3.90 0.000 -.0197624 -.0065428 
pier#private . 0034458 . 0005343 6.45 0.000 . 0023986 . 0044931 
private 
beach .0025544 . 0004443 5.75 0.000 .0016835 .0034252 
private#pier . 0034458 . 0005343 6.45 0.000 . 0023986 .0044931 
private 
private - . 0060002 .0009187 -6.53 0.000 -.0078007 -.0041996 


18.8.4 Berry—Levinson—Pakes market demand model 


An extension of the random parameters logit model to market share data is 
the BLP model, named after its creators (Berry, Levinsohn, and Pakes 1995). 


This model not only allows random parameters but also allows prices to 
be endogenous. The method is computationally burdensome, requiring 
Monte Carlo integration and a contraction mapping to yield a generalized 
method of moments objective function. Furthermore, the method can yield 


multiple optima, and it can be difficult to find the optimum that minimizes 
the objective function; see Knittel and Metaxoglou (2014). 


The community-contributed b1p command provides estimates of the BLP 
model; see Vincent (2015) for further details. 


18.9 Ordered outcome models 


In some cases, categorical data are naturally ordered. An example is health 
status that is self-assessed as poor, fair, good, or excellent. The two standard 
models for such data are the ordered logit and ordered probit models. 


18.9.1 Data summary 


We use data from the Rand Health Insurance Experiment, described in detail 
in section 22.3. We use one year of this panel, so the data are cross-sectional 
data. 


The ordered outcome we consider is health status that is poor or fair 
(y = 1), good (y = 2), or excellent (y = 3). This variable needs to be 
constructed from several binary outcomes for each of the health statuses. 
The categories poor and fair are combined because only 1.5% of the sample 
reports poor health. The data are constructed as follows: 


. * Create multinomial ordered outcome variables that take values y = 1, 2, 3 
. qui use mus21i8rhie, clear 


. qui keep if year == 2 

. generate hlthpf = hlthp + hlthf 

. generate hlthe = (1 - hlthpf - hlthg) 

. qui generate hlthstat = 1 if hlthpf == 

. qui replace hlthstat = 2 if hlthg == 

. qui replace hlthstat = 3 if hlthe == 

. label variable hithstat "Health status" 

. label define hsvalue 1 "Poor or fair" 2 "Good" 3 "Excellent" 
. label values hlthstat hsvalue 

. tabulate hithstat 


Health 
status Freq. Percent Cum. 
Poor or fair 523 9.38 9.38 
Good 2,034 36.49 45.87 
Excellent 3,017 54.13 100.00 


Total 5,574 100.00 


Health status is poor or fair for roughly 10% of the sample, good for 35%, 
and excellent for 55%. 


The regressors considered are age in years (age), log annual family 
income (linc), and number of chronic diseases (ndisease). Summary 


statistics are 


. * Summarize dependent and explanatory variables 


. summarize hlthstat age linc ndisease 


Variable Obs Mean Std. dev. Min Max 
hithstat 5,574 2.447435 .659524 1 3 
age 5,574 25.57613 16.73011 .0253251 63.27515 
linc 5,574 8.696929 1.220592 (0) 10.28324 
ndisease 5,574 11.20526 6.788959 (0) 58.6 


The sample is of children and adults but not the elderly. 
18.9.2 Ordered outcomes 


The ordered outcomes are modeled to arise sequentially as a latent variable, 
y*, crosses progressively higher thresholds. In the current example, y* is an 
unobserved measure of healthiness. For individual 7, we specify 


y =X; B+ ui (18.10) 


where a normalization is that the regressors x do not include an intercept. 
For very low y*, health status is poor or fair; for y* > a1, health status 
improves to good; for y* > az, it improves further to excellent; and so on, if 
there were additional categories. 


For an m-alternative ordered model, we define 


Yi =J tars SO, j=l,...,m 


where ag = —oo and &m = œ. Then, 


Pr(yi = Jj) = Pr(aj-1 < yj < a5) 

= Pr(aj_1 < x1 8+ ui < aj) 

= Pr(aj-1 — x8 < u; < a; —x,Q) 

= F(a; — x,8) — F(aj-1 — x8) 

where F is the cumulative distribution function of u;. The regression 
parameters, 3, and the m — 1 threshold parameters, 1, ...,@m-—1, are 
obtained by maximizing the log likelihood with p;; = Pr(y; = j) as defined 
above. Stata excludes an intercept from the regressors. If instead an intercept 
is estimated, then only m — 2 threshold parameters are identified. 


For the ordered logit model, u is logistically distributed with 
F(z) = e*/(1 + e”). For the ordered probit model, u is standard normally 
distributed with F'(.) = ®(.), the standard normal cumulative distribution 
function. 


The sign of the regression parameters, 3, can be immediately interpreted 
as determining whether the latent variable, y*, increases with the regressor. 
If 8; is positive, then an increase in Tij necessarily decreases the probability 
of being in the lowest category (y; = 1) and increases the probability of 
being in the highest category (y; = m). 


18.9.3 Application of the ologit command 


The parameters of the ordered logit model are estimated by using the ologit 
command, which has syntax essentially the same as mlogit: 


ologit depvar | indepvars | [ of | [ in ] [ weight | es options | 


Application of this command yields 


. * Ordered logit estimates 
. ologit hlthstat age linc ndisease, nolog vce(robust) 


Ordered logistic regression 


Number of obs = 5,574 


Wald chi2(3) = 665.16 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -4769.8525 Pseudo R2 = 0.0720 
Robust 

hlthstat Coefficient std. err. Zz P>|z| [95% conf. interval] 

age -.0292944 .0016646 -17.60 0.000 -.0325568 -.0260319 

linc . 2836537 0280031 10.13 0.000 . 2287687 . 3385387 

ndisease -.0549905 .0041023 -13.40 0.000 - .0630308 -.0469503 

/cut1 -1.39598 . 2505042 -1.886959 -.9050004 

/cut2 .9513097 . 2500833 .4611555 1.441464 


The latent health-status variable is increasing in income and decreasing with 
age and number of chronic diseases, as expected. The regressors are highly 
statistically significant. The threshold parameters appear to be statistically 
significantly different from each other, so the three categories should not be 
collapsed into two categories. 


18.9.4 Predicted probabilities 


Predicted probabilities for each of the three outcomes can be obtained by 
using the pr option. For comparison, we also compute the sample 
frequencies of each outcome. 


. * Calculate predicted probability that y=1, 2, or 3 for each person 
. predict plologit p2ologit p30logit, pr 


summarize hlthpf hlthg hlthe piologit p2ologit p30logit, separator (0) 


Variable Obs Mean Std. dev. Min Max 
hlthpf 5,574 .0938285 .2916161 (0) 1 
hlthg 5,574 . 3649085 . 4814477 (0) 1 
hlthe 5,574 .541263 . 4983392 (0) 1 
plologit 5,574 .0946903 .0843148 0233629 .859022 
p2ologit 5,574 . 3651672 .0946158 . 1255265 . 5276064 
psologit 5,574 .5401425 . 1640575 .0154515 . 7999009 


The average predicted probabilities are within 0.01 of the sample 
frequencies for each outcome. 


18.9.5 MEs 


The ME on the probability of choosing alternative j when regressor £r 
changes is given by 


ae =D) = {F'(a;-1 — xB) — F'(a; — x18) }0, 


If one coefficient is twice as big as another, then so too is the size of the ME. 


We use the margins command with the atmeans option to obtain the MEM 
for the third outcome (health status excellent). We obtain 


. * MEM for third outcome (health status excellent) 
. margins, dydx(*) predict(outcome(3)) atmeans noatlegend 


Conditional marginal effects Number of obs = 5,574 
Model VCE: Robust 


Expression: Pr(hlthstat==3), predict (outcome(3)) 
dy/dx wrt: age linc ndisease 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 

age - .0072824 .0004139 -17.59 0.000 - . 0080937 -.0064712 

linc .070515 .0069821 10.10 0.000 . 0568304 .0841997 
ndisease -.0136704 .0010203 -13.40 0.000 -.0156701 -.0116707 


For the average individual, the probability of excellent health decreases with 
age or increase in diseases and increases as income increases. 


The AME can be computed by dropping the atmeans option. 


18.9.6 Other ordered models 


The parameters of the ordered probit model are estimated by using the 
oprobit command. The command syntax and output are essentially the same 
as for ordered logit, except that coefficient estimates are scaled differently. 


Application to the data here yields ¢ statistics and log likelihoods quite close 
to those from ordered logit. 


The hetoprobit command fits an ordered probit model where the error 
term u; in (18.10) can be heteroskedastic. For example, 


. * MNP with independent errors and alternative-invariant regressors 
. hetoprobit hlthstat age linc ndisease, het(age linc ndisease) 


(output omitted ) 


Compared with ordered probit, there was a substantial increase in the log 
likelihood from — 4,771.04 to — 4,740.65, leading to strong support for the 
heteroskedastic probit model. 


The zioprobit command fits a zero-inflated version of the ordered 
probit model. It is used when a disproportionate fraction of the ordered 
outcomes fall into the lowest category. The lowest category outcomes are 
modeled as arising from both a probit model and an ordered probit model, 
with regressors that may differ in the two models. The zilogit command 
estimates a similar model for the ordered logit model. The community- 
contributed gologit2 command (Williams 2006) fits a generalization of the 
ordered logit model that allows the threshold parameters 1, ..-,Qm-—1 to 
depend on regressors. 


An alternative model is the MNL model. Although the MNL model has 
more parameters, the ordered logit model is not nested within the MNL. 
Estimator efficiency is another way of comparing the two approaches. An 
ordered estimator makes more assumptions than an MNL estimator. If these 
additional assumptions are true, the ordered estimator is more efficient than 
the MNL estimator. 


18.10 Clustered data 


Section 13.9 presented in some detail various models and methods that can 
be applied when observations in the same cluster are correlated while 
observations in different clusters are uncorrelated. That discussion was 
illustrated using the Poisson model. Here we provide a brief summary of 
adaptation to multinomial models. 


Unless a multinomial model explicitly accounts for within cluster 
correlation, the estimates of a multinomial model will be inconsistent given 
clustering because the parametric model for F’;(x;,@) will be misspecified. 
For the simplest multinomial model with clustered data, one might 
nonetheless view the model as a rough starting point for multinomial data, 
analogous to using ordinary least squares for continuous data, and obtain 
cluster—robust standard errors. For an unordered m alternative random- 
effects (RE) model, we can introduce (m — 1) normally distributed cluster- 
specific random intercepts &g2, - - -, ®gm, where we normalize a, = 0. 
These are then integrated out using numerical methods. This RE estimator is 
implemented by the xtmlogit and cmxtmixlogit commands. 


For an unordered multinomial model that has cluster-specific fixed 
effects, the CL method for binary choice can be extended to the MNL model. 
The community-contributed femlogit command implements this estimator; 
see Pforr (2014). 


For ordered models based on a single latent variable crossing a 
threshold, a RE model needs to simply introduce a single normally 
distributed random intercept &%g. This model can be fit using the xtologit 
or xtoprobit command. The multilevel mixed-effects meologit or 
meoprobit command also fit this RE model and, additionally, can allow for 
cluster-specific random-slope coefficients. 


Consistency of resulting parameter estimates for all of these methods 
requires the assumption that individual observations be independent within 
cluster once cluster-specific effects are included. So even though there is a 


vce (robust) option, if it needs to be used, then the parameter estimates are 
most likely inconsistent. 


18.11 Multivariate outcomes 


We consider the multinomial analog of the seemingly unrelated regression 
(SUR) model (see section 6.8), where two or more categorical outcomes are 
being modeled. 


In the simplest case, outcomes do not directly depend on each other— 
there is no simultaneity, but the errors for the outcomes may be correlated. 
When the errors are correlated, a more efficient estimator that models the 
joint distribution of the errors is available. 


In more complicated cases, the outcomes depend directly on each other, 
so there is simultaneity. We do not cover this case, but analysis is much 
simpler if the simultaneity is in continuous latent variables rather than 
discrete outcome variables. 


18.11.1 Bivariate probit 


The bivariate probit model considers two binary outcomes. The outcomes 
are potentially related after conditioning on regressors. The relatedness 
occurs via correlation of the errors that appear in the index-function model 
formulation of the binary outcome model. 


Specifically, the two outcomes are determined by two unobserved latent 
variables, 


yi =x{B, +41 
Y3 = XoG_ + €2 


where the errors €1 and €2 are jointly normally distributed with means of 
0, variances of 1, and correlations of p, and we observe the two binary 
outcomes 


J1 ifyf>o 1 if ys >0 
I= NE APSO 0 iy <0 


The model collapses to two separate probit models for yı and y2 if p = 0. 


There are four mutually exclusive outcomes that we can denote by Y10 
(when yı = 1 and y2 = 0), Yo1, Y11, and Yoo. The log-likelihood function is 
derived using the expressions for these probabilities, and the parameters are 
estimated by ML. There are two complications. First, there is no analytical 
expression for the probabilities, because they depend on a one-dimensional 
integral with no closed-form solution, but this is easily solved with 
numerical quadrature methods for integration. Second, the resulting 
expressions for Pr(y; = 1|x) and Pr(y2 = 1|x) differ from those for binary 
probit and probit. 


The simplest form of the bivariate command has the syntax 
biprobit depvar1 depvar2 | indepvars | [ af | [ in | | weight | [ A options | 


This version assumes that the same regressors are used for both outcomes. A 
more general version allows the list of regressors to differ for the two 
outcomes. 


We consider two binary outcomes using the same dataset as that for 
ordered outcome models analyzed in section 18.9. The first outcome is the 
hlthe variable, which takes a value of 1 if self-assessed health is excellent 
and 0 otherwise. The second outcome is the dmdu variable, which equals 1 if 
the individual has visited the doctor in the past year and 0 otherwise. A data 
summary is 


. * Two binary dependent variables: hlthe and dmdu 
. tabulate hlthe dmdu 


Any MD visit; 1 if 


mdu>0O 
hlthe (0) 1 Total 
0 826 1,731 2,557 
1 1,006 2,011 3,017 
Total 1,832 3,742 5,574 
. correlate hlthe dmdu 
(obs=5,574) 
hlthe dmdu 
hlthe 1.0000 
dmdu -0.0110 1.0000 


The outcomes are very weakly negatively correlated, so in this case, there 
may be little need to model the two jointly. 


Bivariate probit model estimation yields the following estimates: 


. * Bivariate probit estimates 
. biprobit hlthe dmdu age linc ndisease, nolog vce(robust) 


Bivariate probit regression Number of obs = 5,574 
Wald chi2(6) = 767.89 
Log pseudolikelihood = -6958.0751 Prob > chi2 = 0.0000 
Robust 

Coefficient std. err. z P>Izl [95% conf. interval] 

hlthe 
age -.0178246 .0010632 -16.77 0.000 -.0199084 -.0157407 
linc . 132468 .0164203 8.07 0.000 . 1002847 . 1646513 
ndisease -.0326656 .0027701 -11.79 0.000 -.038095 -.0272362 
_cons -.2297079 . 1462127 -1.57 0.116 -.5162794 . 0568637 

dmdu 
age . 0020038 .0010748 1.86 0.062 -.0001028 .0041103 
linc .1212519 .0155242 7.81 0.000 .090825 .1516788 
ndisease .0347111 .0029126 11.92 0.000 .0290025 .0404198 
_cons -1.032527 . 1401069 -7.37 0.000 -1.307132 -.7579229 
/athrho .0282258 .022842 1.24 0.217 -.0165437 .0729953 
rho .0282183 . 0228238 -.0165422 .0728659 
Wald test of rho=0: chi2(1) = 1.52695 Prob > chi2 = 0.2166 


The hypothesis that p = 0 is not rejected, so in this case, bivariate probit was 
not necessary. As might be expected, separate probit estimation for each 
outcome (output not given) yields coefficients very similar to those given 


above. 


Predicted probabilities can be obtained. For example, the marginal 
probability that yı = 1 can be obtained with the pmarg1 option, whereas the 
joint probability that (y1, y2) = (1, 1) is obtained with the p11 option. We 


obtain 


. * Predicted probabilities 


. summarize hlthe dmdu biprobi biprob2 biprobil biprob10 biprob01 biprob0O 


. predict biprob1, pmarg1 
. predict biprob2, pmarg2 
. predict biprobii, pil 
. predict biprob10, p10 
. predict biprob0O1, p01 
. predict biprob00O, p00 
Variable Obs 
hithe 5,574 
dmdu 5,574 
biprobi 5,574 
biprob2 5,574 
biprob11 5,574 
biprob10 5,574 
biprob01 5,574 
biprob00 5,574 


Mean 


.541263 


.6713312 
. 5414237 
.6716857 
. 3610553 


. 1803685 
.3106305 
. 1479458 


Std. dev. 


. 4983392 
. 4697715 
. 1577588 
.0976294 
.0989285 


.0765047 
. 1434517 
. 064902 


.0156161 
. 1589158 
. 0090629 


. 0006476 
. 1090853 
.0158778 


. 7853771 
. 9834746 
.5492701 


. 3680022 
. 9385432 
. 6909308 


The marginal probabilities that y; = 1 and y2 = 1 are, respectively, 0.541 
and 0.672, very close to the sample frequencies. 


18.11.2 Nonlinear SUR 


An alternative model is to use the nisur command for nonlinear SUR, where 
the conditional mean of yı is P(x} 61) and of y2 is ®(x5,). This estimator 
does not control for the intrinsic heteroskedasticity of binary outcome data, 
so we use the vce (robust) option to obtain standard errors that control for 
both heteroskedasticity and correlation. We have 


* Nonlinear seemingly unrelated regressions estimator 
. nlsur (hlthe = normal({al}*age+{a2} 


> *linct{a3}*ndiseaset{a4}) ) 
> (dmdu = normal ({b1}*aget+{b2}*linct+{b3} 
> *ndisease+{b4})), vce(robust) nolog 


(obs = 5,574) 
Calculating NLS estimates... 


Calculating FGNLS estimates... 


FGNLS regression 


Equation Obs Parms RMSE Constant 
1 hithe 5,574 4 .4727309 0.5871* (none) 
2 dmdu 5,574 4 -4595438 0.6854 (none) 
* Uncentered R-sq 
Robust 

Coefficient std. err. Zz P>|zl [95% conf. interval] 

/al -.0173125 .0010624 -16.30 0.000 -.0193948 -.0152302 

/a2 . 1486604 .0184521 8.06 0.000 . 1124949 . 1848259 

/a3 - .0333346 .0028682 -11.62 0.000 -.0389562 -.027713 

/a4 -.3790899 . 1638203 -2.31 0.021 -.7001719 -.0580079 

/b1 .0018343 .0010776 1.70 0.089 -.0002778 .0039464 

/b2 . 1270039 .0165602 7.67 0.000 .0945465 . 1594614 

/b3 .0345088 . 0030258 11.40 0.000 .0285783 .0404393 

/b4 -1.081392 . 1496894 -7 .22 0.000 -1.374778 -.788006 


For this example, the regression coefficients and standard errors are quite 
similar to those from biprobit. 


18.12 Additional resources 


The key models for initial understanding are the MNL and CL models. In 
practice, these models are often too restrictive. Stata commands cover most 
multinomial models. Train (2009) is an excellent source, especially for 
models that need to be fit by MSL or Bayesian methods. 


18.13 Exercises 


1. Consider the health-status multinomial example of section 18.9. Refit 
this as a MNL model using the mlogit command. Comment on the 
statistical significance of regressors. Obtain the MEs of changes in the 
regressors on the probability of excellent health for the MNL model, and 
compare these to those given in section 18.9.5 for the ordered logit 
model. Using Bayesian information criterion, which model do you 
prefer for these data—Mn t or ordered logit? 

2. Consider the CL example of section 18.5. Use mus218hk.dta, if 
necessary to create this file as in section 18.5.1. Drop the charter boat 
option as in section 18.7.5, using drop if fishmode=="charter" | 
mode==4, So we have a three-choice model. Estimate the parameters of 
a CL model with regressors p and q and income, using the cmclogit 
command. What are the MEs on the probability of private boat fishing 
of a $10 increase in the price of private boat fishing, a one-unit change 
in the catch rate from private boat fishing, and a $1,000 increase in 
monthly income? Which model fits these data better—the cL model of 
this question or the MNP model of section 18.7? 

3. Continue the previous question, a three-choice model for fishing mode. 
Estimate the parameters of the model by NL, with errors for the utility 
of pier and beach fishing correlated with each other and uncorrelated 
with the error for the utility of private boat fishing. Obtain the ME of a 
change in the price of private boat fishing, adapting the example of 
section 18.6.6. 

4. Continue with the fishing data example of section 18.5, with beach as 
the base alternative and income as the case variable. Instead of fitting 
an alternative-specific conditional logit model, fit the random- 
parameter logit model of section 18.8 using the cmmixlogit command. 
Is this model a significant improvement on the fixed parameter model? 
Use the margins command to fit the ME of income on the choice of 
fishing mode. Compare the role of income in this model with that in 
the model of section 18.5. 

5. Consider the health-status multinomial example of section 18.9. 
Estimate the parameters of this model as an ordered probit model using 
the oprobit command. Comment on the statistical significance of 


regressors. Obtain the MEs for the predicted probability of excellent 
health for the ordered probit model, and compare these to those given 
in section 18.9.5 for the ordered logit model. Which model do you 
prefer for these data—ordered probit or ordered logit? 


Chapter 19 
Tobit and selection models 


19.1 Introduction 


The tobit model and its generalizations are relevant when the dependent 
variable of a linear regression is incompletely observed. 


The leading example is censoring, where for some observations the 
dependent variable is observed only over some interval of its support. A 
simple example is top-coding of outcomes such as annual income. A more 
subtle example is modeling desired annual expenditures on a new 
automobile, in which case a cross-sectional survey will reveal a significant 
proportion of households with zero expenditure and the rest with a positive 
level of expenditure. 


A related example is truncation, where some observations are not 
observed if they fall outside some interval of support. For example, we may 
observe only those individuals with income less than the top code or 
observe only those individuals with positive annual expenditures on a new 
car. 


Even if the underlying model is linear, ordinary least-squares (OLS) 
regression will not yield consistent parameter estimates, because a censored 
sample or truncated sample is not representative of the population. Instead, 
alternative estimation methods are needed. 


We begin with the basic tobit model, which assumes that the underlying 
regression model is a latent variable model with dependent variable that 
follows a linear regression model with independent homoskedastic 
normally distributed errors. This latent variable is only partially observed. 


We then consider extensions, including models for a dependent variable 
in logarithms, a two-part model in both of which the mechanism generating 
zero outcome has a special role, a particular selection model (the “Heckman 
model”) that allows errors in the two parts to be correlated, and models with 
endogenous regressors. Note, however, that consistency of estimators relies 
very much on normality of errors or minor relaxation of this assumption. To 
extrapolate from a censored sample to an underlying population requires 
strong assumptions. 


Finally, the methods developed in this chapter are used to compare 
balanced and unbalanced samples resulting from panel attrition and to 
correct for attrition bias in linear panel models. 


19.2 Tobit model 


Suppose that our data consist of (y;,x;), 7 = 1,..., N. Assume that x; is fully 
observed but y; is incompletely observed. 


19.2.1 Regression with censored data 


Suppose the household has a latent (unobserved) demand for goods, denoted 
by y*, that is not expressed as a purchase until some known constant threshold, 
denoted by L, is passed. We observe y = y* only when y* > L; otherwise, we 
observe y = 0. This is an example of left-censoring or censoring from below. 


If instead we observe y = y* only when y* < U, and otherwise observe 
y = U, then we have right-censoring or censoring from above. 


This section focuses on left-censoring or censoring from below because 
this is the most common form of censoring with individual level economics 
data. 


19.2.2 Tobit model setup 


The regression of interest is specified as an unobserved latent variable, y*, 


where e; ~ N (0, g?) and X; denotes the (K x 1) vector of exogenous and 
fully observed regressors. If y* were observed, we would estimate (3, o?) by 
OLS in the usual way. 


The observed variable y: is related to the latent variable y; through the 
observation rule 


(19.2) 
fy iby? SL 
Y=] L ifyt<L 


The probability of an observation being censored is 
Pr(y* < L) =Pr(xiGte< L) = {(L—x'B)/o}, where © (.) is the 
standard normal cumulative distribution function. 


The truncated mean, or expected value, of y for the noncensored 
observations can be shown to be 


pi (x, 8 — L)/o} 


E(yi|xi, yi > L) = xP toS Ga —Z)/o} 


(19.3) 


where ¢ (-) is the standard normal density and it is assumed that e; ~ N (0, 07) 
. The conditional mean E'(y;|x;, y; > L) in (19.3) differs from x; 68, as does 
the censored mean E (y;|x;) which, for example, equals 

P(x; 3/o)x'B + o(x;8/oc) when L = 0. Thus, oLs of y on x leads to an 
inconsistent estimate of 3 in the original model (19.1). 


A sample may instead include right-censored observations. Then we 
observe that 


J ye ty 
Y=] U ify >U 


The leading case of censoring is that in which the data are left-censored 
only and L = 0. A variant of the tobit model, the two-limit tobit, allows for 
both left- and right-censoring. Another variant, considered in the application of 
this chapter, is that in which the data are only left-censored but L is unknown. 


19.2.3 Tobit estimation 
The tobit model can be fit by maximum likelihood (ML) or by two-step 


regression. We first consider ML estimation under the assumptions that the 
regression error is homoskedastic and normally distributed. 


For the case of left-censored data with censoring point L, the density 
function has two components that correspond to uncensored and censored 
observations. Let d = 1 denote the censoring indicator for the outcome that the 
observation is not censored, and let d = 0 indicate a censored observation. The 
density can be written as 


fy) = T “[B{(L - x/a)/o}}"- (19.4) 


1 1 
= exp {3 — Ti 


The second term in (19.4) reflects the contribution to the likelihood of the 
censored observation. ML estimates of (B, g?) solve the first-order conditions 
from maximization of the log likelihood based on (19.4). These equations are 
nonlinear in parameters, so an iterative algorithm is required. 


An alternative two-step estimator requires slightly weaker assumptions 
than those for the tobit maximum-likelihood estimator (MLE). Suppose we 
consider regression using only the uncensored observations. Then the 
conditional mean in (19.3) has an additional term ¢; /®;. This missing variable 
can be generated by a probit model that models the probability of the outcome 
that y* > 0. Let d; = 1 denote the outcome that y* > 0, and let d; = 0 
otherwise. Probit estimation of d; on X; using all observations can yield a 
consistent estimate of G and hence a consistent estimate of A; = ¢;/®;. A 
linear regression of y; on x and $b; /®; using just the uncensored observations 


will provide a consistent estimate of 8. 
19.2.4 Robust standard errors 


The tobit MLE is consistent under the stated assumptions, including 

ei ~ N(0, 07). It is inconsistent, however, if the errors are not normally 
distributed or if they are heteroskedastic. These strong assumptions are likely 
to be violated in applications, and this makes the tobit MLE a nonrobust 
estimator. It is desirable to test the assumptions of normality and 
heteroskedasticity. 


Some misspecification is likely, so inference should be based on quasi-ML 
theory; see section 13.3.1. We use robust standard errors that provide 


consistent estimates of the variance of 3, should the model be misspecified. 
For independent observations, we use the vce (robust) option, and for 
observations that are dependent because of clustering, we use the vce (cluster 
clustvar) option. 


19.2.5 The tobit command 


The Stata estimation command for tobit regression with censored data has the 
following basic syntax: 


tobit depvar [ indepvars | [if] [in] [ weight | | 11[ (varname | #) | ulf (varname | #) | 
options | 


The specifications 11[ (#) ] and u1[ (#) ] refer to the lower limit (left-censoring 
point) and the upper limit (right-censoring point), respectively. If the data are 
subject to left-censoring at zero, for example, then only 11 (0) is required. 
Similarly, only u1 (10000) is required for right-censored data at the censoring 
point 10,000. Both are required if the data are both right-censored and left- 
censored and if one wants to estimate the parameters of the two-limit tobit 
model. The postestimation tools for tobit will be discussed later in this 
chapter. 


19.2.6 Unknown censoring point 


As Carson and Sun (2007), and others, have pointed out, the censoring point 
may be unknown. 


Suppose that the data are left-censored, with a constant but unknown 
threshold, 7, in (19.2). The commonly used assumption that the unknown 7 
can be set to zero as a “normalization” is not innocuous. Instead, 

Pr(y* < y) = ® {(y — x’B)/co}, where (y — x’B)/o is interpreted as a 
“threshold”. In this case, we can set 7 = min(uncensoredy) and proceed as if 
Y is known. Estimates of the tobit model based on this procedure have been 
shown to be consistent; see Carson and Sun (2007). 


In Stata, this requires only that the value of y should be used in defining 
the lower limit, so we can again use the tobit command with the 11 (#) 
option. It is simplest to set # equal to y. However, this will treat the 


observation or observations with y = ¥ as censored, and a better alternative is 
to set # equal to 7 — A for some small value, A, such as 1076. 


19.3 Tobit model example 


The illustration we consider, ambulatory expenditures, has the very common 
complication that the data are highly right skewed. This is best treated by 
taking the natural logarithm, complicating the analysis of these 
complications. 


In the current section, we instead present the simpler tobit model in 
levels. The subsequent sections 19.4—19.8 present the tobit model in natural 
logarithms and consequent complications. Model diagnostics are presented 
in section 19.4.5. 


19.3.1 Data summary 


The data on the dependent variable for ambulatory expenditure (ambexp) and 
the regressors (age, female, educ, blhisp, totchr, and ins) are taken from 
the 2001 Medical Expenditure Panel Survey and are for people aged 21—64 
years. In this sample of 3,328 observations, there are 526 (15.8%) 0 values 
of ambexp, SO censoring may be an issue. 


Descriptive statistics for all the variables follow: 


. * Raw data summary 
. qui use mus219mepsambexp, clear 


. summarize ambexp age female educ blhisp totchr ins 


Variable Obs Mean Std. dev. Min Max 
ambexp 3,328 1386.519 2530.406 (0) 49960 
age 3,328 4.056881 1.121212 2.1 6.4 
female 3,328 .5084135 . 5000043 (0) 1 
educ 3,328 13.40565 2.574199 (0) 17 
blhisp 3,328 . 3085938 .4619824 (0) 1 
totchr 3,328 .4831731 . 7720426 (0) 5 
ins 3,328 .3650841 .4815261 (0) 1 


A detailed summary of ambexp provides insight into the potential 
problems in estimating the parameters of the tobit model with a linear 
conditional mean function. 


. * Detailed summary to show skewness and kurtosis 


summarize ambexp, detail 


Annual ambulatory expenditures 


Percentiles Smallest 

1% (0) (0) 

5% 0 0 
10% (0) (0) Obs 3,328 
25%, 113 (0) Sum of wgt. 3,328 
50% 534.5 Mean 1386.519 
Largest Std. dev. 2530.406 

75% 1618 28269 
90% 3585 30920 Variance 6402953 
95% 5451 34964 Skewness 6.059491 
99% 11985 49960 Kurtosis 72.06738 


The ambexp variable is very heavily skewed and has considerable nonnormal 
kurtosis. This feature of the dependent variable should alert us to the 
possibility that the tobit MLE may be a flawed estimator for the model. 


To see whether these characteristics persist if the zero observations are 
ignored, we examine the sample distribution of only positive values. 


. * Summary for positives only 
summarize ambexp if ambexp >0, detail 


Annual ambulatory expenditures 


Percentiles Smallest 

1% 22 1 

5% 67 2 
10% 107 2 Obs 2,802 
25% 275 4 Sum of wgt. 2,802 
50% 779 Mean 1646.8 
Largest Std. dev. 2678.914 

75% 1913 28269 
90% 3967 30920 Variance 7176579 
95% 6027 34964 Skewness 5.799312 
99% 12467 49960 Kurtosis 65.81969 


The skewness and nonnormal kurtosis are reduced only a little if the zeros 


are ignored. 


In principle, the skewness and nonnormal kurtosis of ambexp could be 
due to regressors that are skewed. But, from output not listed, an OLS 


regression of ambexp ON age, female, educ, blhisp, totchr, and ins 
explains little of the variation (R? = 0.16), and the oLs residuals have a 
skewness statistic of 6.6 and a kurtosis statistic of 92.2. Even after 
conditioning on regressors, the dependent variable is very nonnormal, and a 
lognormal model may be more appropriate. 


19.3.2 Tobit analysis 


As an initial exploratory step, we will run the linear tobit model without any 
transformation of the dependent variable, even though it appears that the 
data distribution may be nonnormal. The lowest positive values of ambexp 
(of 1, 2, 2, and 4) are close to zero compared with the median of positive 
values of 779, so it is natural to set the left-censoring point Z at 0 in this 
example. 


. * Tobit analysis for ambexp using all expenditures 
. global xlist age female educ blhisp totchr ins // Define regressor list $xlist 


. tobit ambexp $xlist, 11(0) vce(robust) 
Refining starting values: 

Grid node 0: log likelihood = 
Fitting full model: 


-26432.72 


Iteration 0: log pseudolikelihood = -26432.72 
Iteration 1: log pseudolikelihood = -26361.187 
Iteration 2: log pseudolikelihood = -26359.43 
Iteration 3: log pseudolikelihood = -26359 .424 
Iteration 4: log pseudolikelihood = -26359 .424 
Tobit regression Number of obs 3,328 
Uncensored = 2,802 
Limits: Lower = 0 Left-censored = 526 
Upper = +inf Right-censored = 0 
F(6, 3322) = 59.52 
Prob > F = 0.0000 
Log pseudolikelihood = -26359.424 Pseudo R2 0.0130 
Robust 
ambexp | Coefficient std. err. t P>ltl [95% conf. interval] 
age 314.1479 41.19122 7.63 0.000 233.3852 394.9107 
female 684.9918 100.1353 6.84 0.000 488.6585 881.325 
educ 70.8656 17.25925 4.11 0.000 37 .02577 104.7054 
blhisp -530.311 102.8097 -5.16 0.000 -731.8877 -328.7342 
totchr 1244.578 98.91188 12.58 0.000 1050.644 1438.513 
ins -167.4714 84.42021 -1.98 0.047 -332.9923 -1.95054 
_cons -1882.591 317.2026 -5.93 0.000 -2504.524 -1260.659 
var (e . ambexp) 6635296 1088362 4810499 9152305 


All regressors are statistically significant at the 0.05 level. The interpretation 
of the coefficients is as a partial derivative of the latent variable, y*, with 
respect to x. Marginal effects (MEs) for the observed variable, y, are 


presented in section 19.3.4. 


19.3.3 Prediction after tobit 


The predict command is summarized in section 4.2. The command can be 
used after tobit to predict a range of quantities. We begin with the default 
linear prediction, xb, that produces fitted values of the latent variable, y*, for 
all observations in the sample. 


. * Tobit prediction and summary 
. predict yhatlin 
(option xb assumed; fitted values) 


. Summarize yhatlin 


Variable Obs Mean Std. dev. Min Max 


yhatlin 3,328 1066.683 1257.455 -1564.703 8027 . 957 


A more detailed comparison of the sample statistics for ynatlin with those 
for ambexp shows that the tobit model fits especially poorly in the upper tail 
of the distribution. 


This fact notwithstanding, we will use the model to illustrate other 
prediction options for the observed variable, y, that can be used in 
combination with the computation of MEs. 


19.3.4 MEs 


In a censored regression, there are a variety of MEs that are of potential 
interest; see [R] tobit postestimation. The ME is the effect on the conditional 
mean of the dependent variable of changes in the regressors. This effect 
varies according to whether interest lies in the latent variable mean, the 
truncated mean or the censored mean. 


Omitting derivations given in Cameron and Trivedi (2005, chap. 16), for 
censoring from below at zero and with censoring observations taking value 
zero, we see the MES are 


Latent variable OE(y*|x)/0x = B 
Left-truncated (at 0) OE(y|x, y > 0)/Ox = {1 — wA(w) — A(w)?}B 
Left-censored (at 0) OE(y|x)/0x = ®(w)G 


where w = x’G/o and A(w) = ¢(w)/®(w). The first of these two has 
already been discussed above. 


Left-truncated and left-censored examples 


The predict () option of margins is used to obtain MEs with respect to a 
desired quantity such as the left-truncated mean. Here we obtain the ME at 
the mean value of the regressors (MEM), using the atmeans option, so the MEs 
are for the average individual in the sample. Factor-variable notation is used 
so that the MEMs for indicator variables are computed using the finite- 
difference method. 


We begin with the mems for the left-truncated mean, E'(y|x, y > 0). 


. * (1) MEMs for E(ylx,y>0) 
. qui tobit ambexp age i.female educ i.blhisp totchr i.ins, 11(0) vce(robust) 


. margins, dydx(*) predict(e(0,.)) atmeans noatlegend 


Conditional marginal effects Number of obs = 3,328 
Model VCE: Robust 
Expression: E(ambexp|ambexp>0), predict(e(0,.)) 
dy/dx wrt: age 1.female educ 1.blhisp totchr 1.ins 
Delta-method 
dy/dx std. err. Zz P>lz| [95% conf. interval] 
age 145.524 18.79351 7.74 0.000 108.6893 182.3586 
1.female 317.1037 44.11704 7.19 0.000 230.6359 403.5715 
educ 32.82734 7.790956 4.21 0.000 17.55735 48 .09733 
1.blhisp -240 . 2953 46.5901 -5.16 0.000 -331.6102 -148.9804 
totchr 576.5307 44.95038 12.83 0.000 488 . 4296 664.6318 
1.ins -77.19554 38.28773 -2.02 0.044 -152.2381 -2.152964 


Note: dy/dx for factor levels is the discrete change from the base level. 


For example, for the average individual 10 more years of aging (a l-unit 
change in variable age) is associated with a $146 increase in expenditures, 
conditional on his or her expenditures being positive. For these data, the 


MEMS are roughly one-half of the coefficient estimates, Bs given in 


section 19.3.2. 


The MEMs for the censored mean, FE (y|x), are computed next. 


. * (2) MEMs for E(ylx) 
. margins, dydx(*) predict(ystar(0,.)) atmeans noatlegend 


Conditional marginal effects Number of obs = 3,328 
Model VCE: Robust 


Expression: E(ambexp*|ambexp>0), predict (ystar(0,.)) 
dy/dx wrt: age 1.female educ 1.blhisp totchr 1.ins 


Delta-method 


dy/dx std. err. Zz P>|z| [95% conf. interval] 

age 207 .526 26 .80233 7.74 0.000 154.9944 260.0576 
1.female 451.6399 62.75074 7.20 0.000 328.6507 574.6291 
educ 46.81378 11.11639 4.21 0.000 25.02605 68.60151 
1.blhisp -342 . 4803 66.2929 -5.17 0.000 -472.412 -212.5486 
totchr 822.1678 64.07788 12.83 0.000 696.5774 947.7581 
1.ins -110.0883 54.6086 -2.02 0.044 -217 .1192 -3.05739 


Note: dy/dx for factor levels is the discrete change from the base level. 


For example, it is predicted that regardless of whether medical expenditures 
are positive, for the average individual 1 more year of aging is associated 
with a $208 increase in expenditures. These MEMs for the average individual 
are larger in absolute value than those for the left-truncated mean and are 
roughly 70% of the original coefficient estimates. 


Left-censored case computed directly 


Next, we illustrate direct computation of the MEs for the left-censored mean, 
which we see above is 6(x’3/c) Bj for the jth regressor. 


This example also illustrates how to retrieve tobit model coefficients. 
There are six regressors, but three are binary indicator variables that each 
take up two components in e (b). So the first 9 (= 3 + 2 x 3) components of 
e (b) are regressors, the 10th is the constant term, and the 11th is the estimate 
of g2. We have 


. * Direct computation of MEMs for E(y|x) 
. predict xb1, xb // xb1 is estimate of x“b 


. matrix btobit = e(b) 
scalar sigmasq = btobit[1,11] // sigmasgq is estimate of sigma”2 

. matrix bcoeff = btobit[1,1..9] // bcoeff is betas excl. constant 
qui summarize xbi 
scalar meanxb = r(mean) // Mean of x°b equals (mean of x)’b 
scalar PHI = normal(meanxb / sqrt (sigmasq) ) 


. Matrix deriv = PHI*bcoeff 


. Matrix list deriv 


deriv[1,9] 
ambexp: ambexp: ambexp: ambexp: ambexp: ambexp: 
Ob. 1. Ob. 1. 
age female female educ blhisp blhisp 
yi 207.52598 O 452.50523 46.813781 0 -350.32317 
ambexp: ambexp: ambexp: 
Ob. 1. 
totchr ins ins 
yi 822.16778 O -110.63154 
. ereturn post deriv // Print nicer looking results 
. ereturn display 
Coefficient 
ambexp 
age 207 .526 
1.female 452.5052 
educ 46.81378 
1.blhisp -350 . 3232 
totchr 822.1678 
1.ins -110.6315 


As expected, the MEs for continuous regressors are identical to those 
obtained above with margins. For binary regressors, there is some difference 
because margins uses the finite-difference method rather than calculus 
methods; see section 13.7.6. 


Marginal impact on probabilities 


Interest may also lie in the impact of a change in a regressor on the 
probability that y is in a specified interval. For illustration, consider the 
MEMS for Pr(5000 < ambexp < 10000). 


. * Compute MEMs for Pr(5000<ambexp<10000) 
. qui tobit ambexp age i.female educ i.blhisp totchr i.ins, 11(0) vce(robust) 


. margins, dydx(*) predict (pr(5000,10000)) atmeans noatlegend 


Conditional marginal effects 
Model VCE: Robust 


Expression: Pr(5000<ambexp<10000), predict (pr (5000, 10000) ) 


dy/dx wrt: 


age 
1.female 
educ 
1.blhisp 
totchr 
1.ins 


dy/dx 


.0150449 
.0328152 
. 0033938 
-.0239676 
. 0596042 
-.0079162 


Delta-method 
std. err. 


.002801 
. 0068034 
. 0009929 
. 0049076 
. 0090396 
. 0043282 


Zz 


5.37 
4.82 
3.42 
-4.88 
6.59 
-1.83 


age 1.female educ 1.blhisp totchr 1.ins 


P>lz| 


oo0oo0oo0o0o0O 


. 000 
. 000 
.001 
. 000 
. 000 
.067 


Number of obs = 3,328 
[95% conf. interval] 
.0095551 .0205347 
.0194808 .0461496 
.0014478 .0053399 

- .0335864 -.0143488 
.0418869 .0773214 

-.0163994 .0005669 


Note: dy/dx for factor levels is the discrete change from the base level. 


The effects appear very small, though note that from separate computation, 
only 5% of the sample falls into this range. 


19.3.5 Truncated regression 


In some applications, the data are truncated, so that only data that are not 
censored are observed. In that case, the truncated tobit MLE can be obtained 
using the truncreg command. 


For expositional purposes, we can generate truncated data by restricting 
attention to only those observations with ambexp > 0. The truncated tobit 


estimates with left-truncation at zero are then 


. * Truncated tobit with truncation at zero 
. truncreg ambexp age female educ blhisp totchr ins, 11(0) nolog vce(robust) 
(526 obs truncated) 


Truncated regression 


Limit: Lower = 
Upper = 


0 


+inf 


Log pseudolikelihood = -23265.704 


ambexp 


age 
female 
educ 
blhisp 
totchr 
ins 
_cons 


/sigma 


Coefficient 


28466.17 
34695.92 
1226.796 
-33156.71 
42830.02 
-26356.62 
-371878.7 


17530.02 


Robust 


std. err. 


13473.91 
22217.13 
1909.361 
15953.57 
18250.76 
16758.18 
181883.8 


4362.635 


Number of obs 
Wald chi2(6) 


Prob > chi2 

P>|z| [95% conf. 
0.035 2057.793 
0.118 -8848.853 
0.521 -2515.483 
0.038 -64425.13 
0.019 7059.18 
0.116 -59202.05 
0.041 -728364.5 
0.000 8979.409 


2,802 
7.58 
0.2706 


interval] 


54874.55 
78240.69 
4969.076 
-1888.297 
78600.85 
6488.81 
-15392.94 


26080.62 


The coefficients have the same sign as those from censored tobit or OLS 
regression but are many times larger and appear to be unrealistic. For 
example, 10 years of aging is estimated to lead to $28,466 higher ambulatory 


expenditures. 


However, parameter interpretation can become difficult in nonlinear 


models, and it is MEs that matter. For example, the MEMs for the left- 


truncated mean, F:(y|x, y > 0), are 


. * MEMs for E(y|x) following truncated tobit 
. margins, dydx(*) predict(e(0,.)) atmeans noatlegend 


Conditional marginal effects 


Model VCE: Robust 


Expression: E(ambexp|ambexp>0), predict(e(0,.)) 
age female educ blhisp totchr ins 


dy/dx wrt: 


age 
female 
educ 
blhisp 
totchr 
ins 


dy/dx 


184.017 
224.2886 
7.930513 

-214.3386 
276.8708 
-170.38 


Delta-method 
std. err. 


27.3567 
72.98036 
11.56309 
95.91943 

23.0198 

63.4146 


P>|z| 


. 000 
.002 
. 493 
.025 
. 000 
. 007 


oo0oo0oo0oo0oo 


Number of obs = 


[95% conf. 


130.3988 
81.24976 
-14.73274 
-402 . 3372 
231.7528 
-294.6703 


2,802 


interval] 


237.6351 
367.3275 
30.59376 
-26 . 33997 
321.9888 
-46 . 08967 


These are much more comparable with the corresponding Mems following 
censored tobit estimation. 


19.3.6 Additional commands for censored and truncated regression 


The censored least-absolute-deviations estimator of Powell (1984) provides 
consistent estimates for left-censored or right-censored data under the 
weaker assumption that the error, £, in (19.1) is independent and identically 
distributed and symmetrically distributed. This is implemented with the 
Semykina (2000). For these data, the method is best implemented for the 
data in logs. 


The intreg command is a generalization of tobit for data observed in 
intervals. For example, expenditures might be observed in the ranges y < 0, 
0 < y < 1000, 1000 < y < 10000, and y > 10000. 


A quite different type of right-censored data is duration data on length of 
unemployment spell or survival data on time until death. The standard 
approach for such data is to model the conditional hazard of the spell ending 
rather than the conditional mean. Modeling the conditional hazard has the 
advantage of permitting the use of the Cox proportional hazards model, 
which allows semiparametric estimation without strong distributional 
assumptions such as an exponential or Weibull distribution for durations. For 
details, see section 21.5. 


For ML estimates of censored Poisson and truncated Poisson and negative 
binomial for count data, see section 20.3.7. 


The gsem command presented in section 23.6 enables estimation of 
censored regression for a normally distributed dependent variable and 
truncated regression for normally distributed, Poisson, and several 
parametric survival models. 


The following command gives results identical to those obtained earlier 
using the tobit command. 


. * gsem alternative for estimating tobit regression 
. gsem ambexp <- $xlist, family(gaussian, lcensored(0)) vce(robust) 


(output omitted ) 


For any given latent variable density f*(y*|x, 0), it is easy to code up 
the log density for censored or truncated y and obtain ML estimates using the 
mlexp Or ml command. For example, for Y: left-truncated at zero, the log 
density is 


In f (yi|x:, 9) = di ln f* (yi|xi, 0) + (1 — d;) In F* (0;|x;, 0) 


where d; = 1 if y; > 0, d; = 0 if y; = 0, and F*(y*|x, 0) is the cumulative 
distribution function of y*. 


19.3.7 Clustered data 


The tobit model is highly dependent on distributional assumptions and not 
robust to within-cluster correlation. Instead, one needs to generalize the tobit 
model to a richer parametric model that explicitly incorporates within- 
clustering correlation. 


The metobit command allows a normally distributed intercept and slope 
parameters to vary at the cluster level, as does the meintreg command for 
interval data; see section 23.4. For panel data, the xttobit and xtintreg 
commands allow the intercept to vary at the cluster level. 


19.4 Tobit for lognormal data 


The tobit model relies crucially on normality, but expenditure data are often 
better modeled as lognormal. A tobit regression model for lognormal data 
introduces two complications: a nonzero threshold and lognormal y. 


Now introduce lognormality by specifying 
y* = exp(x’ßB +£), em~ N(0,07) 


where we observe that 


Here it is known that y = 0 when data are censored, and in general y Æ 0. 
The parameters of this model can be fit using the tobit command with the 
11(#) option, where the dependent variable is In y rather than y and the 
threshold # equals the minimum uncensored value of In y. The censored 
values of ln y must be set to a value equal to or less than the minimum 
uncensored value of In y. 


In this model, interest lies in the prediction of expenditures in levels 
rather than logs. The issues are similar to those considered in section 3.4.8 
for the lognormal model. Some algebra yields the censored mean 


= gf — g? 
E(y|x) = exp (x64 T) fı P (2 xP )} (19.5) 


The truncated mean E(y|x, y > 0) equals E(y|x)/[1 — ®{(y — x'B)/c}]. 


19.4.1 Data example 


The illustrative application of the tobit model considered here uses the same 
data as in section 19.3. We remind the reader that in this sample of 3,328 
observations, there are 526 (15.8%) zero values of ambexp. A detailed 
summary of In(ambexp), denoted by 1ambexp, follows. 


. * Summary of log(expenditures) for positives only 
. Summarize lambexp, detail 


Natural logarithm of ambexp 


Percentiles Smallest 

1% 3.091043 (0) 

5% 4.204693 .6931472 
10% 4.672829 .6931472 Obs 2,802 
25% 5.616771 1.386294 Sum of wgt. 2,802 
50% 6.65801 Mean 6.555066 
Largest Std. dev. 1.41073 

75% 7.556428 10.24952 
90% 8.285766 10.33916 Variance 1.990161 
95% 8.704004 10.46207 Skewness -.3421614 
99% 9.43084 10.81898 Kurtosis 3.127747 


The summary shows that, for positive values of ambexp, In(ambexp) 1s 
almost symmetrically distributed and has negligible nonnormal kurtosis. 
This is in stark contrast to ambexp even after conditioning on regressors; see 
section 19.3.1. We anticipate that the tobit model is better suited to modeling 
lambexp than ambexp. 


19.4.2 Setting the censoring point for data in logs 


It may be preferred at times to apply a transformation to the dependent 
variable to make it more suitable for a tobit application. In the present 
instance, we work with In(ambexp) as the dependent variable. This variable 
is originally set to missing if anbexp = 0, but to use the tobit command, 
we need to set it to a nonmissing value, the lower limit. 


Here the smallest positive value of ambexp is 1, in which case In(ambexp) 
equals 0. Then Stata’s default 11 option or 11 (0) option mistakenly treats 
this observation as censored rather than as zero, leading to shrinkage in the 


sample size for noncensored observations. In our sample, one observation 
would be thus “lost”. To avoid this loss, we “tricked” Stata by setting all 
censored observations of In y to an amount slightly smaller than the 
minimum noncensored value of In y, as follows: 


. * "Tricking" Stata to handle log transformation and labeling variables 
. generate y = ambexp 


. generate dy = ambexp > 0 


. generate lny = ln(y) // Zero values will become missing 
(526 missing values generated) 


. qui summarize lny 
scalar gamma = r(min) // This could be negative 
. display "gamma = " gamma 
gamma = 0 
. replace lny = gamma - 0.0000001 if lny == . 
(526 real changes made) 


. tabulate y if y < 0.02 // 0.02 is arbitrary small value 

y Freq. Percent Cum. 
0 526 100.00 100.00 

Total 526 100.00 

. tabulate lny if lny < gamma + 0.02 
lny Freq. Percent Cum. 
-1.00e-07 526 99.81 99.81 
0 1 0.19 100.00 

Total 527 100.00 


Note that the dependent variables have been relabeled. This makes the 
Stata code given later easier to adapt for other applications. In what follows, 
y is the ambexp variable, and In y is the 1ny variable. Variables y, dy, and 1ny 
are already included in the dataset. The analysis below uses the scalar gamma, 
which equals — 0.0000001 for these data. 


. * Code below needs gamma which is the lower limit of lny (here 0) - 0.0000001 
. scalar gamma = -0.0000001 


19.4.3 Results 


We first obtain the tobit MLE, where now log expenditures is the dependent 
variable. 


. * Now do tobit on lny 
. tobit lny $xlist, 11(gamma) vce(robust) 


Refining starting values: 


Grid node 0: log likelihood = -7574.942 
Fitting full model: 
Iteration 0: log pseudolikelihood = -7574.942 
Iteration 1: log pseudolikelihood = -7496.5907 
Iteration 2: log pseudolikelihood = -7494.2941 
Iteration 3: log pseudolikelihood = -7494.29 
Iteration 4: log pseudolikelihood = -7494.29 
Tobit regression Number of obs = 3,328 
Uncensored = 2,802 
Limits: Lower = -0.00 Left-censored = 526 
Upper = +inf Right-censored = 0 
F(6, 3322) = 156.84 
Prob > F = 0.0000 
Log pseudolikelihood = -7494.29 Pseudo R2 = 0.0525 
Robust 
lny | Coefficient std. err. t P>|tl [95% conf. interval] 
age . 3630699 0457116 7.94 0.000 . 2734441 -4526957 
female 1.341809 0991297 13.54 0.000 1.147447 1.53617 
educ . 138446 0201113 6.88 0.000 0990142 .1778779 
blhisp -.8731611 . 1174141 -7.44 0.000 -1.103372 -.6429499 
totchr 1.161268 .053751 21.60 0.000 1.055879 1.266656 
ins . 2612202 .0989029 2.64 0.008 .0673035 -4551369 
_cons .9237178 . 3548763 2.60 0.009 .2279195 1.619516 
var (e.1ny) 7.735265 . 2840619 7.197889 8.31276 


All estimated coefficients are statistically significant at the 0.05 level and 
have the expected signs. 


To assess the impact of using the censored regression framework instead 
of treating the zeros like observations from the same data-generating process 
as the positives, let us compare the results with those from the OLS regression 
of iny on the regressors. 


. * OLS, not tobit 
. regress lny $xlist, noheader vce(robust) 


Robust 

lny | Coefficient std. err. t P>|t | [95% conf. interval] 
age . 3247317 0388171 8.37 0.000 . 2486239 . 4008396 
female 1.144695 .0831535 13.77 0.000 . 9816578 1.307733 
educ .114108 0165032 6.91 0.000 .0817505 . 1464654 
blhisp -.7341754 .0974335 -7.54 0.000 -.9252112 - .5431397 
totchr 1.059395 .0462565 22.90 0.000 . 968701 1.150089 
ins . 2078343 . 0840858 2.47 0.013 .0429692 . 3726995 
_cons 1.728764 . 2865494 6.03 0.000 1.166933 2.290596 


All the ors slope coefficients are in absolute terms smaller than those for the 
ML tobit, the reduction being 10—15%, but the OLS intercept is larger. The 
impact of censoring (zeros) on the OLS results depends on the proportion of 
censored observations, which in our case is around 15%. 


19.4.4 Two-limit tobit 


In less than 1.5% of the sample (48 observations), ambexp exceeds $10,000. 
Suppose that we want to exclude these high values that contribute to the 
nonnormal kurtosis. Or suppose that the data above an upper-cutoff point are 
reported as falling in an interval. Choosing $10,000 as the upper-censoring 
point, we fit a two-limit tobit version of the tobit model. We see that the 
impact of dropping the 48 observations is relatively small. This is not too 
surprising because a small proportion of the sample is right-censored. 


. * Now do two-limit tobit 

. scalar upper = log(10000) 

. display upper 

9.2103404 

. tobit lny $xlist, 11(gamma) ul(9.2103404) vce(robust) 
Refining starting values: 

Grid node 0: log likelihood = -7544.0716 

Fitting full model: 


Iteration 0: log pseudolikelihood = -7544.0716 
Iteration 1: log pseudolikelihood = -7453.8109 
Iteration 2: log pseudolikelihood = -7451.7643 
Iteration 3: log pseudolikelihood = -7451.7623 
Iteration 4: log pseudolikelihood = -7451.7623 


Tobit regression Number of obs = 3,328 

Uncensored = 2,754 

Limits: Lower = -0.00 Left-censored = 526 

Upper = 9.21 Right-censored = 48 

F(6, 3322) = 149.00 

Prob > F = 0.0000 

Log pseudolikelihood = -7451.7623 Pseudo R2 = 0.0534 
Robust 

lny | Coefficient std. err. t P>|t| [95% conf. interval] 

age .3711061 .0463719 8.00 0.000 . 2801858 - 4620264 

female 1.348768 . 1003897 13.44 0.000 1.151936 1.5456 

educ . 1402643 .0203227 6.90 0.000 . 1004181 . 1801105 

blhisp -.8759505 . 1187037 -7.38 0.000 -1.10869 -.6432108 

totchr 1.20494 0571746 21.07 0.000 1.092839 1.317041 

ins . 2466838 . 1001165 2.46 0.014 .0503875 - 4429802 

_cons . 8638458 . 3591707 2.41 0.016 . 1596275 1.568064 

var (e.1ny) 7.909052 . 2943484 7.352483 8.507753 


19.4.5 Model diagnostics 


To test the validity of the key tobit assumptions of normality and 
homoskedasticity, we need to apply some diagnostic checks. For the 
ordinary linear regression model, the sktest and estat hettest commands 
are available to test for normality and homoskedasticity. These tests are 
based on the OLS residuals. These postestimation tests are invalid for 
censored data because the fitted values and residuals from a censored model 
do not share the properties of their ordinary regression counterparts. Instead, 


generalized residuals for censored regression, as discussed in Cameron and 
Trivedi (2005, 630) and in Verbeek (2017, 238—240), provide the key 
component for generating test statistics for testing the null hypotheses of 
homoskedasticity and normality. 


In linear regression, tests of homoskedasticity typically use squared 
residuals, and tests of normality use residuals raised to a power of 3 or 4. 
The first step is then to construct analogous quantities for the censored 
regression. 


For uncensored observations, we use ¢, = (yi — xi 8)/6 raised to the 


relevant powers, where yi is a generic notation for the dependent variable, 
which here is In(ambexp). 


For observations truncated from above, with In y; < y and d; = 0, we 


use the truncated moment of the normal error listed in table 19.1, evaluated 
at 3 and G. 


Table 19.1. Moments of normal errors truncated from above 


Moments Expression 
ei = 0) —; 

Bele, =0) —(24+ 27)Ai 

E(e#|d;j =0) —(3z; + 22)rj 


notes: A; = b(2;)/P(z;) and Zi = (y — x! B)/o. 


The components ¢(-) and ®(-) given in table 19.1 can be evaluated by using 
Stata’s normalden() and normal () functions. Given these and predicted 
values from the ML regression, the four “generalized” components given in 
the table can be readily computed. 


19.4.6 Tests of normality and homoskedasticity 


Lagrange multiplier, or score, tests of heteroskedasticity and nonnormality 
are appealing because they require estimation only of the models under the 
hypothesis of normality and homoskedasticity. The test statistics are 
quadratic forms that can be calculated in several different ways. One way is 
by using an auxiliary regression; see section 11.5.3 and Cameron and 
Trivedi (2005, chap. 8). 


Conditional moment tests can also be performed by using a similar 
approach; see section 11.9.1, Newey (1985), and Pagan and Vella (1989). 
Such regression-based tests have been developed with generalized residuals. 
Although they are not currently available as a part of the official Stata 
package, they can be constructed from Stata output, as illustrated below. The 
key component of the auxiliary regression is the uncentered R2, denoted by 
R?, from the auxiliary regression of 1 on generated regressors that are 
themselves functions of generalized residuals. The specific regressors 
depend upon the alternative to the null. 


We implement two conditional moment tests below. The first is a test of 
normality based on testing whether the uncensored error £; has third moment 
zero and fourth moment 3,2, as is the case for the normal distribution. Thus, 
we test H{ (yf — x/)*} = 0 and B{(yz — x/3)* — 302} = 0. The 
heteroskedasticity test is a test of E{w;(y* — x/3)*} = 0, where w; are 
variables that g2 may vary with if there is heteroskedasticity. 


Generalized residuals and scores 


To implement the test, we first need to compute the inverse of Mills’s ratio, 
A; because from table 19.1, the generalized residuals depend in part on A;. 


. * Compute Mills’s ratio 
. qui tobit lny $xlist, 11(gamma) vce(robust) 


. predict xb, xb // xb is estimate of xb 

. matrix btobit = e(b) 

. scalar sigma = sqrt(btobit[1,e(df_m)+2]) // sigma is estimate of sigma 

. generate threshold = (gamma-xb)/sigma // gamma: lower censoring point 


. generate lambda = normalden(threshold) /normal (threshold) 


Next, we calculate generalized residuals and functions of them. For 
example, gres3, the sample analogue of {(y* — x/3)/o}%, equals 
{(yi — x! 3) /o}° for an uncensored observation and equals — (2 + 22) Xj for 
a censored observation, where z; and A; are defined in the table. The 
generalized residuals gres1 and gres2 can be shown to be the contributions 
to the score for, respectively, the intercept 3, and ø, so they must sum to 
zero over the sample. The generalized residuals gres3 and gres4 satisfy the 
same zero-mean property only if the model is correctly specified. 


The generalized residuals are computed as follows: 


. * Generalized residuals: First create gresi, which has mean zero, by the f.o.c. 
. qui generate uifdyeqi = (lny - xb)/sigma if dy == 


. qui generate double gresi = uifdyeqi 
. qui replace gresi = -lambda if dy == 


. summarize gres1 


Variable Obs Mean Std. dev. Min Max 


gresi 3,328 -7.03e-10 .9877495 -3.129662 2.245604 


The zero-mean property of gres1 is thus verified. The remaining three 
variables are computed next. 


. * Generalized residuals: Next create gres2, gres3, and gres4 
. qui generate double gres2 = uifdyeqi”2 - 1 


. qui replace gres2 = -threshold*lambda if dy == 

. qui generate double gres3 = uifdyeqi"3 

. qui replace gres3 = -(2 + threshold*2)*lambda if dy == 

. qui generate double gres4 = uifdyeqi"4 - 3 

. qui replace gres4 = -(3*threshold + threshold*3)*lambda if dy == 


Test of normality 


To apply the auxiliary regression to implement the test for normality, we 
need to calculate the likelihood scores that are included as additional 
regressors; see section 11.5.3. The components of the scores with respect to 
G are }, times the relevant component of x, that is, \,x,. These can be 
computed by using the foreach command: 


. * Generate the scores to use in the Lagrange multiplier test of normality 
. foreach var in $xlist { 
2. generate score var” = gresi* > var’ 


3. } 


. global scores score* gres1 gres2 


Recall that gres1 is the score with respect to the intercept 3, and gres2 is 
the score with respect to the intercept o. So the global scores contains all 
the scores with respect to 8B and g2 for the tobit model. 


To execute the regression-based test of normality, a test based on the 
third and fourth moments of €;, we regress one on gres3 and gres4, along 
with scores, and compute the y R2 statistic. Because two moment 
conditions are being tested, the test statistic is y?(2) under the null 
hypothesis; the remaining variables in the regression are scores that need to 
be included for the test based on the auxiliary regression to be valid. 


. * Test of normality in tobit regression uses NR°2 from uncentered regression 
. generate one = 1 


. qui regress one gres3 gres4 $scores, noconstant 


. display "N R^2 = " e(N)*e(r2) " with p-value = " chi2tail(2,e(N)*e(r2)) 
N R°2 = 1832.128 with p-value = 0 


The outcome of the test is a very strong rejection of the normality 
hypothesis, even though the expenditure variable was transformed to 
logarithms. 


The properties of the conditional moment test as implemented here have 
been investigated by Skeels and Vella (1999), who found that using the 
asymptotic distribution of this test produces severe size distortions, even in 
moderately large samples. This is an important limitation of the test. 
Drukker (2002) developed a parametric bootstrap to correct the size 
distortion by using bootstrap critical values. His Monte Carlo results show 
that the test based on bootstrap critical values has reasonable power for 
samples larger than 500. 


The community-contributed tobcm command (Drukker 2002) 
implements this better variant of the test. The command works only after 
tobit and with the left-censoring point 0 and no right-censoring. To 
compare the above outcome of the normality test with that from the 


improved bootstrap version, the interested reader can perform the tobcm 
command quite easily. 


Test of homoskedasticity 


For testing homoskedasticity, the alternative hypothesis is that the variance 
of £; is of the form o? exp(w/a), and we test whether 

E{wi(y* — x.)?} = 0. This leads to an auxiliary regression of 1 on 

w; X gres2,; often, w is specified to be the same as x, though necessarily 
excludes the intercept. If dim(w,;) = K, then NR? ~ x? (K) under the null 
hypothesis. 


The following test sets w = x, aside from the intercept, so 6 moment 
conditions are tested, and inference is based on the y?(6) distribution; the 
remaining variables in the regression are scores that need to be included for 
the test based on the auxiliary regression to be valid. 


. * Test of homoskedasticity in tobit regression with w=x (aside from intercept) 
. foreach var in $xlist { 
2. generate gres2by var” = gres2*> var~ 


3. } 
. qui regress one gres2by* $scores, noconstant 


. display "N R°2 = " e(N)*e(r2) " with p-value = " chi2tail(6,e(N)*e(r2)) 
N R°2 = 596.73953 with p-value = 1.18e-125 


There is strong rejection of the null hypothesis of homoskedasticity against 
the alternative that the variance is of the form specified. If an investigator 
wants to specify different components of w, then the required modifications 
to the above commands are trivial. 


19.4.7 Next step? 


Despite the apparently satisfactory estimation results for the tobit model, the 
diagnostic tests reveal weaknesses. The failure of normality and 
homoskedasticity assumptions has serious consequences for censored-data 
regression that do not arise in the case of linear regression. A natural 
question that arises concerns the direction in which additional modeling 
effort might be directed to arrive at a more general model. 


Two approaches to such generalization will be considered. The two-part 
model, given in the next section, specifies one model for the censoring 
mechanism and a second distinct model for the outcome conditional on the 
outcome being observed. The sample-selection model, presented in the 
subsequent section, instead specifies a joint distribution for the censoring 
mechanism and outcome and then finds the implied distribution conditional 
on the outcome observed. 


19.5 Two-part model in logs 


The tobit regression makes a strong assumption that the same probability 
mechanism generates both censored and uncensored observations, where 
throughout we presume censoring at y = 0. It is more flexible to allow for 
the possibility that the zero and positive values are generated by different 
mechanisms. Many applications have shown that an alternative model, the 
two-part model or the hurdle model, can provide a better fit by relaxing the 
tobit model assumptions. 


This model is the natural next step in our modeling strategy. Again, we 
apply it to a model in logs rather than levels. 


19.5.1 Model structure 


The first part of the model is a binary outcome equation that models 
Pr(ambexp > 0), using any of the binary outcome models considered in 
chapter 17 (usually probit). The second part uses linear regression to model 
E (ln ambexp|ambexp > 0). The two parts are assumed to be independent 
and are usually estimated separately. 


Let y denote ambexp. Define a binary indicator, d, of positive expenditure 
such that d = 1 if y > 0 and d = 0 if y = 0. When y = 0, we model 
Pr(d = 0). For those with y > 0, let f(y|d = 1) be the conditional density 
of y. The two-part model for y is then given by 


Pr(d = 0|x) if y=0 


Jp) = l Prid=ix)\f@ld=1.8) ify s0 ao) 


The same regressors often appear in both parts of the model, but this can and 
should be relaxed if there are obvious exclusion restrictions. 


The probit or the logit is an obvious choice for the first part. If a probit 
model is used, then Pr(d = 1|x) = ® (x161). Ifa lognormal model for 


yly > 0 is given, then (In y|d = 1,x) ~ N (x482, 03). Combining these, we 
have for the model in logs 


E (y|x1, x2) = ® (x181) exp (x282 + 3/2) 


where the second term uses the result that if In y ~ N (u, 07), then 
E(y) = exp(u + 07/2). 


ML estimation of (19.6) is straightforward because it separates the 
estimation of a discrete choice model using all observations and the 
estimation of the parameters of the density f(y|d = 1, x) using only the 
observations with y > 0. 


19.5.2 Part 1 specification 


In the example considered here, xı = X2, but there is no reason why this 
should always be so. It is an advantage of the two-part model that it provides 
the flexibility to have different regressors in the two parts. In this example, 
the first part is modeled through a probit regression, and again one has the 
flexibility to change this to logit or complementary log—log regression. 
Comparing the results from the tobit, two-part, and selection models is a 
little easier if we use the probit form. 


* Part 1 of the two-part model 
. probit dy $xlist, nolog vce(robust) 


Probit regression Number of obs = 3,328 

Wald chi2(6) = 318.08 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -1197.6644 Pseudo R2 = 0.1754 
Robust 

dy | Coefficient std. err. Zz P>|zl [95% conf. interval] 

age .097315 .0272976 3.56 0.000 . 0438127 . 1508173 

female .6442089 .0610491 10.55 0.000 .524555 . 7638629 

educ .0701674 .0108653 6.46 0.000 .0488718 .0914631 

blhisp -.3744867 .0610172 -6.14 0.000 -.4940781 -.2548952 

totchr . 7935208 .0739854 10.73 0.000 6485121 . 9385294 

ins .1812415 -0611857 2.96 0.003 0613198 . 3011632 

_cons -.7177087 . 1860721 -3.86 0.000 -1.082403 -.3530141 


scalar llprobit = e(11) 


The probit regression indicates that all covariates are statistically significant 
determinants of the probability of positive expenditure. The standard ME 


calculations can be done for the first part, as illustrated in chapter 17. 


19.5.3 Part 2 of the two-part model 


The second part is a linear regression of 1ny, here In(ambexp), on the 


regressors in the global macro xlist. 


. * Part 2 of the two-part model 


. regress lny $xlist if dy==1, vce(robust) 
Linear regression Number of obs = 2,802 
F(6, 2795) s 134.62 
Prob > F = 0.0000 
R-squared = 0.1918 
Root MSE = 1.2696 

Robust 

lny | Coefficient std. err. t P>|t | [95% conf. interval] 
age .2172327 .0220962 9.83 0.000 . 1739062 . 2605592 
female . 3793756 . 0489726 7.75 0.000 . 2833494 . 4754018 
educ . 0222388 . 0096629 2.30 0.021 .0032917 .0411859 
blhisp -.2385321 .0560156 -4.26 0.000 -.3483682 -.1286961 
totchr .5618171 .028185 19.93 0.000 .5065516 .6170826 
ins -.020827 . 0488003 -0.43 0.670 -.1165153 .0748614 
-cons 4.907825 .1717797 28.57 0.000 4.570997 5.244653 


. scalar lllognormal = e(11) 


. predict rlambexp, residuals 


The coefficients of regressors in the second part have the same sign as those 
in the first part, aside from the ins variable, which is highly statistically 
insignificant in the second part. 


Given the assumption that the two parts are independent, the joint 
likelihood for the two parts is the sum of two log likelihoods, that is, 
— 5,838.8. The computation is shown below. 


* Create two-part model log likelihood 
scalar lltwopart = llprobit + lllognormal // Two-part model log likelihood 


. display "lltwopart = " lltwopart 
lltwopart = -5838.8218 


By comparison, the log likelihood for the tobit model is — 7,494.29. The 
two-part model fits the data considerably better, even if Akaike’s 
information criterion or Bayesian information criterion is used to penalize 
the two-part model for its additional parameters. 


Does the two-part model eliminate the twin problems of 
heteroskedasticity and nonnormality? This is easily checked using the estat 
hettest and sktest commands. 


. * hettest and sktest commands require default standard errors 
. qui regress lny $xlist if dy== 


. hettest 


Breusch-Pagan/Cook-Weisberg test for heteroskedasticity 
Assumption: Normal error terms 
Variable: Fitted values of lny 


HO: Constant variance 


chi2(1) = 19.25 
Prob > chi2 = 0.0000 


. sktest rlambexp 


Skewness and kurtosis tests for normality 


Joint test 
Variable Obs Pr(skewness) Pr(kurtosis) Adj chi2(2) Prob>chi2 
rlambexp 3,328 0.0000 0.0592 368.86 0.0000 


The tests on log expenditures for those with positive expenditures 
unambiguously reject the homoskedasticity and normality hypotheses. 
However, unlike the tobit model, neither condition is necessary for 
consistency of the estimator because the key assumption needed is that 
E(1n y|d = 1,x) is linear in x. On the other hand, it is known that the OLs 
estimate of the residual variance will be biased in the presence of 
heteroskedasticity. This deficiency will extend to those predictors of y that 
involve the residual variance. This point is pursued further in an end-of- 
chapter exercise. 


From the viewpoint of interpretation, the two-part model is flexible and 
attractive because it allows different covariates to have a different impact on 
the two parts of the model. For example, it allows a variable to make its 
impact entirely by changing the probability of a positive outcome, with no 
impact on the size of the outcome conditional on it being positive. In our 
example, the coefficient of ins in the conditional regression has a small and 
statistically insignificant coefficient but has a positive and significant 
coefficient in the probit equation. 


19.6 Selection models 


The two-part model attains some of its flexibility and computational 
simplicity by assuming that the two parts—the decision to spend and the 
amount spent—are independent. Then selection is only on observables. 


This is a potential restriction on the model. If it is conceivable that, after 
one controls for regressors, those with positive expenditure levels are not 
randomly selected from the population, then the results of the second-stage 
regression suffer from selection bias. In that case, selection is on both 
observables and unobservables. 


The selection model used in this section considers the possibility of such 
bias by allowing for possible dependence across the two parts of the model. 
This new model is an example of a bivariate sample-selection model. It is 
known variously as the type-2 tobit model, the heckit model, and incidental 
truncation with a probit selection equation. 


There are several different ways that selection on unobservables may 
arise leading to different selection models. The type-2 tobit model is the 
most commonly used model and is used as one method to control for panel 
attrition in section 19.11. Other selection models include the endogenous 
switching regimes model given in section 19.9.3. 


A censored sample is one where the dependent variable is incompletely 
observed. Instead of observing (y;,x,;), we observe {h(y;), xi}, where 
examples of h(-) include left-censoring at 0 [so h(-) = min(0, y;)], right- 
censoring, interval-censoring, and observing only a binary outcome. The 
tobit, intreg, and probit commands cover these cases if Y is conditionally 
normally distributed. 


Sample selection is instead the case where data on some observations are 
missing. A simple example is truncation at 0 where we observe (y*, x;) only 
if y¥ > 0. Another simple example is where data on X;, or a subcomponent 
of Xi, are missing for some observations. A more complicated example is the 
selection model of this section where we may observe yi if a related variable 
crosses a threshold. 


The application in this section uses expenditures in logs. The same 
method can be applied without modification to expenditures in levels. 


19.6.1 Model structure and assumptions 


Throughout this section, an asterisk will denote a latent variable. Let y3 
denote the outcome of interest, here expenditure or log expenditure. In the 
standard tobit model, this outcome is observed if y3 > 0. A more general 
model introduces a second latent variable, yj, and the outcome y5 is 
observed if yj > 0. In the present case, yj determines whether an individual 
has any ambulatory expenditure, y5 determines the level or logarithm of 
expenditure, and yj # y5. 


The two-equation model comprises a selection equation for y1, where 


_f 1 ify >0 
et) ab yt<0 


and a resultant outcome equation for y2, where 


_ J yg ify >o 
am se a A 


Here y2 is observed only when yf > 0, possibly taking a negative value, 
whereas y2 need not take on any meaningful value when yj < 0. The classic 
version of the model is linear with additive errors, so 


yi =x, 8, +41 
Y3 = XoG_ + €2 


with €1 and €2 possibly correlated. The tobit model is a special case 
where yj = y5. 


It is assumed that the correlated errors are jointly normally distributed 
and homoskedastic; that is, 


fal~*[lo}: Low of II 


where the normalization g? = 1 is used because only the sign of yj is 
observed. Estimation by ML is straightforward. 


The likelihood function for this model is 
N 
L = [[ {Pro < 0} {ful yt; > 0) x Pr(ys*> 0)}"™ 


where the first term is the contribution when yï; < 0, because then y1; = 0, 
and the second term is the contribution when yï; > 0. This likelihood 
function can be specialized to models other than the linear model considered 
here. In the case of linear models with jointly normal errors, the bivariate 
density, f* (yï, y3 ), is normal, and hence the conditional density in the 
second term is univariate normal. 


The essential structure of the model and the ML estimation procedure are 
not affected by the decision to model positive expenditure on the log (rather 
than the linear) scale, although this does affect the conditional prediction of 
the level of expenditure. This step is taken here even though tests 
implemented in the previous two sections show that the normality and 
homoskedasticity assumptions are both questionable. 


19.6.2 ML estimation of the sample-selection model 


ML estimation of the bivariate sample-selection model with the heckman 
command is straightforward. The basic syntax for this command is 


heckman depvar | indepvars | [ af | [ in | | weight |, 
select (| depvar_s =] varlist_s |i; noconstant offset (varname_o) |) | options | 


where select () is the option for specifying the selection equation. One 
needs to specify variable lists for both the selection equation and for the 
outcome equation. In many cases, the investigator might use the same set of 
regressors in both equations. When this is done, it is often referred to as the 
case in which model identification is based solely upon the nonlinearity in 
the functional form. Because the selection equation is nonlinear, it 
potentially allows the higher powers of regressors to affect the selection 
variable. In the linear outcome equation, of course, the higher powers do not 
appear. Therefore, the nonlinearity of the selection regression automatically 
generates exclusion restrictions. That is, it allows for an independent source 
of variation in the probability of a positive outcome, hence the term 
“identification through nonlinear functional form”. 


The specification of the selection equation involves delicate 
identification issues. For example, if the nonlinearity implied by the probit 
model is slight, then the identification will be fragile. Thus, it is common in 
applied work to look for exclusion restrictions. The investigator seeks 
variables that can generate nontrivial variation in the selection variable but 
do not affect the outcome variable directly. This is exactly the same 
argument as was encountered in earlier chapters in the context of 
instrumental variables (Iv). A valid exclusion restriction arises if a suitable 
instrument is available, and this may vary from case to case. We will 
illustrate the practical importance of these ideas in the examples that follow. 


An alternative to the heckman command is available within the family of 
extended regression model (ERM) commands presented in section 23.7. This 
alternative command, eregress, uses the option select () and has syntax 
similar to the heckman command as in, for example, eregress 1ny x, 
select (dy = x z) nolog, where z is an exogenous variable in the selection 
equation but excluded from the outcome equation. 


Sample selection extends beyond continuous outcome models. For 
example, binary outcomes may be subject to sample selection also; that is, 
the probability of observing a 0/1 outcome may be subject to sampling bias. 
Given that the probit model is based on the normal distribution, Heckman’s 


approach for handling sample selection in the standard normal regression 
can be extended also to the probit regression. Stata’s command heckprobit 
will handle such a case; its syntax is similar to the heckman command. 
Alternatively, the equivalent eprobit command can be used. A further 
extension to ordered probit models with sample selection is also feasible 
using the heckoprobit command or the equivalent eoprobit command, 
which is also a member of the ERM suite of commands based on the recursive 
structure assumption. The heckpoisson command accommodates selection 
in the Poisson model. 


19.6.3 Estimation without exclusion restrictions 


We first estimate the parameters of the selection model without exclusion 
restrictions. 


* Heckman MLE without exclusion restrictions 


. heckman lny $xlist, select(dy = $xlist) nolog vce(robust) 


Heckman selection model Number of obs 3,328 
(regression model with sample selection) Selected = 2,802 
Nonselected = 526 
Wald chi2(6) = 517.92 
Log pseudolikelihood = -5838.397 Prob > chi2 = 0.0000 
Robust 

Coefficient std. err. Zz P>I|zl (95% conf. interval] 

lny 
age .2122921 .0225168 9.43 0.000 .16816 . 2564242 
female . 349728 .053152 6.58 0.000 . 2455519 . 453904 
educ .0188724 .0099991 1.89 0.059 -.0007256 . 0384704 
blhisp -.2196042 .057581 -3.81 0.000 -.332461 -.1067474 
totchr .5409537 .029986 18.04 0.000 . 4821822 .5997252 
ins -.0295368 .0491937 -0.60 0.548 -.1259547 .0668811 
_cons 5.037418 . 1976607 25.49 0.000 4.65001 5.424826 

dy 
age .0984482 .02728 3.61 0.000 . 0449804 .151916 
female . 6436686 .0610712 10.54 0.000 .5239713 . 7633659 
educ .0702483 .0108621 6.47 0.000 .0489589 .0915377 
blhisp -.3726284 .0610389 -6.10 0.000 -.4922625 -.2529944 
totchr . 7946708 .0738284 10.76 0.000 .6499697 . 9393718 
ins . 1821233 .0611301 2.98 0.003 .0623106 . 301936 
_cons -.7244413 . 1862954 -3.89 0.000 -1.089574 -.359309 
/athrho -.124847 .0949188 -1.32 0.188 -.3108845 .0611905 
/lnsigma . 2395983 .0152022 15.76 0.000 . 2098026 . 2693941 
rho -.1242024 .0934546 -.3012415 .0611142 
sigma 1.270739 .019318 1.233435 1.309171 
lambda -.1578287 . 1190885 -.3912379 .0755805 
Wald test of indep. eqns. (rho = 0): chi2(1) = 1.73 Prob > chi2 = 0.1884 


The log likelihood for this model is very slightly higher than that for the 
two-part model: — 5,838.4 compared with — 5,838.8 (see section 19.5.3). 
Consistent with this small difference is the finding that p = —0.124 with the 
95% confidence interval |—0.301, 0.061]. The Wald test has a p-value of 


0.188. 


Thus, the estimated correlation between the errors is not significantly 
different from zero, and the hypothesis that the two parts are independent 
cannot be rejected. This is an important result. For some types of censored 


data, notably, hours of work, it is necessary to control for selection, and it 
can make a big difference. For other types of data, such as medical 
expenditure data, there is often less of a selection problem, and a two-part 
model may be sufficient. 


The foregoing conclusion should be treated with caution because the 
model is based on a bivariate normality assumption that is itself suspect. The 
two-step estimation, considered next, relies on a univariate normality 
assumption and is expected to be relatively more robust. 


19.6.4 Two-step estimation 


The MLE in the sample-selection model assumes joint normality of the errors 
€ and £2. This assumption can be relaxed. Instead, we need assume only 
that 


Ey ~ N(0, 1) 


E2 = 0121 + N 


where 7 1s an independent error. The outcome equation error is a 
multiple of the selection equation error, which is standard normal 
distributed, plus noise. 


It can be shown that these assumptions imply that 
E(y2|x, yt > 0) = x382 + o12A(x161) (19.7) 


where A(-) = (-)/®(-). The motivation is that y3 = x5, + £2 so that 
given selection E(y2|x, yj > 0) = x58, + E(€2|y{ > 0). Given the above 
assumptions, 

E(éalyt > 0) = Ele2|e1 > —x, G1) = o12E (e1]e1 > —x181) = 712A(x4 91) 
, Where the last result uses the formula for the left-truncated mean of the 
standard normal. 


The second term in (19.7) can be estimated by \(x/ 3,), where G, is 
obtained by probit regression of yı on xı. The OLS regression of y2 on X2 and 


the generated regressor, (x4, B,)> called the inverse of Mills’s ratio or the 
nonselection hazard, yields a semiparametric estimate of (33,012). The 
calculation of the standard errors, however, is complicated by the presence in 
the regression of the generated regressor, Ne ACHE 


The addition of the twostep option to heckman yields the two-step 
estimator. 


. * Heckman two-step without exclusion restrictions 
. heckman lny $xlist, select(dy = $xlist) twostep 


Heckman selection model -- two-step estimates Number of obs = 3,328 
(regression model with sample selection) Selected = 2,802 
Nonselected = 526 
Wald chi2(6) = 189.46 
Prob > chi2 = 0.0000 
Coefficient Std. err. Zz P>lz| [95% conf. interval] 
lny 
age . 202124 .0242974 8.32 0.000 . 1545019 . 2497462 
female .2891575 .073694 3.92 0.000 . 1447199 4335951 
educ .0119928 .0116839 1.03 0.305 -.0109072 . 0348928 
blhisp -.1810582 .0658522 -2.75 0.006 -.3101261 -.0519904 
totchr -4983315 .0494699 10.07 0.000 -4013724 .5952907 
ins -.0474019 0531541 -0.89 0.373 -.151582 0567782 
_cons 5.302572 . 2941363 18.03 0.000 4.726076 5.879069 
dy 
age 097315 0270155 3.60 0.000 .0443656 . 1502645 
female . 6442089 .0601499 10.71 0.000 .5263172 . 7621006 
educ 0701674 .0113435 6.19 0.000 .0479345 . 0924003 
blhisp - . 3744867 .0617541 -6.06 0.000 -.4955224 -.2534509 
totchr . 7935208 .0711156 11.16 0.000 .6541367 . 9329048 
ins . 1812415 .0625916 2.90 0.004 .0585642 . 3039187 
_cons -.7177087 . 1924667 -3.73 0.000 -1.094937 -.3404809 
/mills 
lambda - .4801696 . 2906565 -1.65 0.099 -1.049846 . 0895067 
rho -0.37130 


sigma 1.2932083 


The reported standard errors for the regression coefficients, Bo» control for 
the estimation error of A(x} 31); see [R] heckman. These standard errors are 
in general larger than those from the ML estimation. Although no standard 


error is provided for rho=lambda/sigma, the hypothesis of independence of 
€, and £2 can be tested directly by using the coefficient of lambda, because 
from (19.7), this is the error covariance 012. The coefficient of Lambaa has a 
larger z statistic, — 1.65, than in the ML case, and it is significantly different 
from 0 at any p-value higher than 0.099. Thus, the two-step estimator 
produces somewhat stronger evidence of selection than does the ML 
estimator. 


The heckman command standard errors are based on analytical results 
and are not robust. Robust standard errors can be obtained by a bootstrap, as 
done for this example in section 12.4.5. 


Alternatively, we can stack the moment conditions for the two steps and 
estimate by just-identified generalized method of moments (GMM). The 
estimating equations for the first-step ML probit and the second step OLS 
estimation are, respectively, 


Yiyi — O(%,,81) j B 
» | BOL BL BoC By u= 


N 
So ily = Xoo = o12(x};3,)} F ie j — 


wl 


This model can be fit using the gmm command by a method similar to that in 
section 13.3.11 for the control function estimator in the linear model. 


19.6.5 Estimation with exclusion restrictions 


The standard errors of the two-step estimator are larger than those of the ML 
estimator in part because the variable (x4, B) can be collinear with the 
other regressors in the outcome equation (x2). This is highly likely if 

X1 = X2, as would be the case when there are no exclusion restrictions. 
Having exclusion restrictions, so that x; Æ x2, may reduce the collinearity 
problem, especially in small samples. 


For more robust identification, it is usually recommended, as has been 
explained above, that exclusion restrictions be imposed. This requires that 


the selection equation has an exogenous variable that is excluded from the 
outcome equation. Moreover, the excluded variable should have a substantial 
(nontrivial) impact on the probability of selection. Because it is often hard to 
come up with an excluded variable that does not directly affect the outcome 
and does affect the selection, the investigator should have strong justification 
for imposing the exclusion restriction. 


We repeat the ML computation of the Heckman model with an additional 
regressor, income, in the selection equation. 


* Heckman MLE with exclusion restriction 
. heckman lny $xlist, select(dy = $xlist income) nolog vce(robust) 


Heckman selection model Number of obs 3,328 
(regression model with sample selection) Selected = 2,802 
Nonselected = 526 
Wald chi2(6) = 484.20 
Log pseudolikelihood = -5836.219 Prob > chi2 = 0.0000 
Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 
lny 
age .2119749 .0225429 9.40 0.000 .1677915 .2561583 
female .3481441 .0541676 6.43 0.000 . 2419776 .4543105 
educ .018716 . 0099997 1.87 0.061 -.0008831 .038315 
blhisp -.2185714 .0578622 -3.78 0.000 -.3319792 -.1051637 
totchr . 53992 .030624 17.63 0.000 .479898 .5999419 
ins -.0299871 .0492054 -0.61 0.542 -.126428 .0664538 
_cons 5.044056 . 1999372 25.23 0.000 4.652187 5.435926 
dy 
age .0879359 .0278018 3.16 0.002 .0334453 . 1424264 
female . 6626649 .061461 10.78 0.000 . 5422036 . 7831263 
educ .0619485 .0113921 5.44 0.000 .0396203 . 0842766 
blhisp - . 3639377 .0612186 -5.94 0.000 - . 4839239 -.2439515 
totchr . 7969518 .0736115 10.83 0.000 .6526759 .9412276 
ins . 1701367 .061304 2.78 0.006 .0499831 . 2902904 
income .0027078 .0013265 2.04 0.041 .0001078 .0053077 
_cons -.6760546 . 1873167 -3.61 0.000 -1.043189 -.3089207 
/athrho -.1313456 . 1029162 -1.28 0.202 -.3330576 .0703664 
/lnsigma . 2398173 .0152203 15.76 0.000 . 2099862 . 2696485 
rho -.1305955 . 1011609 -.3212655 .0702505 
sigma 1.271017 .0193452 1.233661 1.309504 
lambda -.1659891 . 128968 -.4187617 . 0867836 
Wald test of indep. eqns. (rho = 0): chi2(1) 1.63 Prob > chi2 = 0.2019 


The results are only slightly different from those reported above, although 
income appears to have significant additional explanatory power. 
Furthermore, the use of this exclusion restriction is debatable because there 


are reasons to expect that income should also appear in the outcome 


equation. 


Another command that delivers output almost identical to that produced 
by the heckman ML command is Stata’s eregress command, which has 


syntax almost identical to that of the former, as can be seen below (but 
output is suppressed): 


* Heckman MLE with exclusion restriction 
eregress lny $xlist, select(dy = $xlist income) nolog vce(robust) 


(output omitted ) 


19.7 Nonnormal models of selection 


The classical fully parametric approach to estimation of selection model has 
limitations. First, controlling for selection relies heavily on the distributional 
assumption of joint normality. Second, it typically requires that we have 
available a nontrivial excluded exogenous variable that serves as an 
identifying restriction. Further, its test of selection bias relies on a linear 
measure of dependence (correlation) between the unobserved factors that 
simultaneously impact both selection and the outcome. 


In principle, one can generalize the selection model to any bivariate 
distribution, but in practice there are relatively few suitable bivariate 
distributions. For example, the bivariate normal is unusual in its tractability 
and in its properties of having conditional distributions and marginal 
distributions that are normally distributed. 


19.7.1 Copula models 


These limitations can be addressed using a copula-based estimator. This 
permits choice of marginal distributions for the two parts of the selection 
model that are well-suited to the type of data being analyzed and then uses a 
copula function to introduce correlation or, more generally, dependence, 
between these two parts. 


This fully parametric approach is founded on Sklar’s Theorem, which 
proves that under certain conditions, marginal distributions of variables, say, 
f(yi|B,) and g(y2|G,), can be combined using a copula function to obtain a 
joint distribution h(y1, y2|G,, Bə, 0), where @ is a scalar measure of 
dependence between yı and y2. The properties of the dependence parameter 
vary with the functional form of the copula used, and there are many 
potential copulas with varying degrees of restrictiveness. Given the choice of 
functional forms of marginals and the copula, ML methods can be used to 
estimate all the parameters. 


The first step of this approach is to specify the marginal distributions 
(which will be conditioned on exogenous variables z) for the latent variable 
that determines the selection indicator and for the latent outcome variable. A 
logistic (logit) or normal (probit) is the obvious choice of distribution for the 
selection latent variable, while the distribution for the latent outcome will 
vary with the type of data. 


The second step is to use the copula function that binds these two 
densities. Let f(yt|z, G,) and g(y5|z, B>) denote the densities for, 
respectively, the selection equation latent variable and the outcome latent 
variable. These are combined using a specified parametric copula, denoted 
C{f(yt|z, 61), 9(y3|z, Bz), 0}, where @ is a scalar-valued dependence 
parameter. This generates a joint distribution in latent variables that leads to 
a tractable joint density for the observed data (d, y2) that can be estimated by 
ML. This method separates the specification and estimation of the marginal 
from the problem of inference about the dependence parameter. 


The flexibility of the approach comes from the feasibility of varying both 
the marginal distributions and the functional form of the copula, C (-). 
Further, provided the margins are sufficiently flexible, this approach 
potentially can model a variety of symmetric and asymmetric dependence 
structures, not just linear dependence as under a normal distribution. A 
broader type of sample-selection effect can be captured. 


Table 19.2 presents the functional forms of four widely used bivariate 
copulas for uniform random variables (u1, u2). Different copulas have 
different domains and place different restrictions on the dependence 
parameter. Some restrict dependence to be nonegative. And the special case 
of independence does not always correspond to @ — 0. For example, the 
Gumbel copula implies independence if 9 = 1. 


Table 19.2. Some standard copula functions 


Copula type Function C (u1, u2) 6-domain 


Gaussian alz (ui) &G' (uz); 9] —1 <9 <a 
Student’s t pe (© Ferg TE —1 << +1 
x fp acest A i 
Clayton a tan" = (0, 00) 
Gumbel exp (—(u8 + u$)1/°) , ùj = —Inu; [1, 00) 


Having to choose the functional form of the copula is a potential 
disadvantage. In practice, one may consider several combinations of copulas 
and marginal distributions and select the “best” model according to an 
information criterion. 


19.7.2 Copula application 


The community-contributed heckmancopula command (Hasebe 2013) is 
structured to fit the classic Heckman selection model with a wide range of 
distributions. The latent variable for the selection equation may be normal 
or logistic. The latent outcome variable may be normal, logistic, or 
Student’s t. The copula may be product, gaussian, fgm, plackett, amh, 
clayton, frank, gumbel, OF joe. 


We first present results for the Gaussian copula with normal marginals. 
This duplicates ML estimation of the standard selection model using the 
heckman command. 


. * Copula-based selection models with Gaussian copula (same as heckman) 
. qui use mus219mepsambexp, clear 


. global xlist age female educ blhisp totchr ins 


// Regressor list $xlist 


. heckmancopula lny $xlist, select(dy = $xlist income) copula(gaussian) 
> margsel(probit) margini(normal) vce(robust) 


Iteration 0: 
Iteration 1: 
Iteration 2: 
Iteration 3: 


log pseudolikelihood = -5836.7639 


log pseudolikelihood = -5836.2404 
log pseudolikelihood = -5836.2193 
log pseudolikelihood = -5836.2192 


Sample Selection Model: Copula gaussian, Margins probit-normal 


Number of obs = 3,328 
Wald chi2(7) = 325.56 
Log pseudolikelihood = -5836.2192 Prob > chi2 = 0.0000 
Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 
select 
age .0879359 .0278018 3.16 0.002 . 0334453 . 1424264 
female . 6626649 .061461 10.78 0.000 . 5422036 . 7831263 
educ .0619485 .0113921 5.44 0.000 . 0396203 . 0842766 
blhisp -. 3639377 .0612186 -5.94 0.000 - . 4839239 . 2439515 
totchr . 7969518 .0736115 10.83 0.000 .652676 .9412277 
ins . 1701368 .061304 2.78 0.006 .0499831 . 2902904 
income .0027078 .0013265 2.04 0.041 .0001078 .0053077 
_cons -.6760546 . 1873167 -3.61 0.000 -1.043189 . 3089207 
lny 
age .2119749 .0225429 9.40 0.000 .1677915 .2561582 
female . 3481439 .0541676 6.43 0.000 . 2419774 . 4543104 
educ .018716 . 0099997 1.87 0.061 -.0008831 .038315 
blhisp -.2185714 .0578622 -3.78 0.000 -.3319791 . 1051636 
totchr .5399199 .030624 17.63 0.000 .4798979 .5999418 
ins - .0299872 .0492054 -0.61 0.542 -.1264281 .0664537 
_cons 5.044057 . 1999372 25.23 0.000 4.652187 5.435927 
lnsigma 
-cons . 2398173 .0152203 15.76 0.000 . 2099862 . 2696485 
atheta 
-cons -.1313461 . 1029162 -1.28 0.202 - . 3330582 .0703661 
theta -.1305959 .101161 -.321266 .0702502 
tau .0833781 .0649574 -.0447595 . 2082167 
Wald test of independence : Test statistic 1.667 with p-value 0.1967 


. estimates store Gaussian_N 


The output is identical to that earlier for the heckman example with exclusion 
restriction. The parameter theta is the Pearson correlation coefficient and is 
not significantly different from zero by either the Wald test or the Lagrange 
multiplier test. The parameter tau reports Kendall’s tau, an alternative 
measure of correlation to the Pearson correlation coefficient. There is little 
evidence of significant selection. 


Next, we fit three additional nonnormal selection models. All use probit 
for the selection equation. All but the last use the normal for the selection 
equation. The Frank, Farlie-Gumbel—Morgenstern, and Plackett copulas are 
used. 


* Copula-based selection models with several different copulas 
. qui heckmancopula lny $xlist, select(dy = $xlist income) copula(frank) 
> margsel(probit) margini(normal) vce(robust) 


. estimates store Frank_N 


. qui heckmancopula lny $xlist, select(dy = $xlist income) 
> copula(fgm) margsel(probit) margini(normal) vce(robust) 


. estimates store FGM_N 


. qui heckmancopula lny $xlist, select(dy = $xlist income) 
> copula(plackett) margsel(probit) margini(t) df(10) vce(robust) 


. estimates store Plackett_t10 


estimates table Gaussian_N Frank_N FGM_N Plackett_t10, 
eq(1) se b(%13.4f) stats(N 11) 


ve 


Variable Gaussian_N Frank_N FGM_N Plackett_t10 
#1 
age 0.0879 0.0868 0.0869 0.0885 
0.0278 0.0278 0.0281 0.0279 
female 0.6627 0.6635 0.6635 0.6641 
0.0615 0.0615 0.0615 0.0614 
educ 0.0619 0.0619 0.0619 0.0618 
0.0114 0.0114 0.0114 0.0114 
blhisp -0.3639 -0.3658 -0.3657 -0.3637 
0.0612 0.0612 0.0613 0.0612 
totchr 0.7970 0.7957 0.7959 0.7982 
0.0736 0.0738 0.0741 0.0736 
ins 0.1701 0.1691 0.1691 0.1700 
0.0613 0.0614 0.0614 0.0613 
income 0.0027 0.0027 0.0027 0.0027 
0.0013 0.0013 0.0013 0.0013 
_cons -0.6761 -0.6684 -0.6690 -0.6774 
0.1873 0.1872 0.1888 0.1878 
lny 
age 0.2120 0.2173 0.2171 0.2164 
0.0225 0.0221 0.0229 0.0223 
female 0.3481 0.3798 0.3786 0.3621 
0.0542 0.0489 0.0553 0.0498 
educ 0.0187 0.0223 0.0222 0.0244 
0.0100 0.0097 0.0098 0.0099 
blhisp -0.2186 -0.2388 -0.2380 -0.2259 
0.0579 0.0560 0.0584 0.0561 
totchr 0.5399 0.5621 0.5613 0.5509 
0.0306 0.0282 0.0310 0.0278 
ins -0.0300 -0.0207 -0.0210 -0.0363 
0.0492 0.0487 0.0489 0.0483 
_cons 5.0441 4.9061 4.9109 4.9589 
0.1999 0.1716 0.1993 0.1800 
lnsigma 
_cons 0.2398 0.2374 0.2374 0.1381 
0.0152 0.0151 0.0151 0.0145 
atheta 
_cons -0.1313 0.0098 -0.0089 -0.2924 
0.1029 0.0004 0.3229 0.1563 
Statistics 
N 3328 3328 3328 3328 
11 -5836.2192 -5836.6736 -5836.6729 -5827 .7743 


Legend: b/se 


All models use the same number of parameters, so we can directly compare 
log likelihoods. The first three models give similar log likelihoods, which is 
not surprising, because there appears to be little evidence of sample selection 
for these data. The final model fits considerably better because we use a t 
distribution rather than the normal distribution for the outcome; the t 
distribution has heavier tails than the normal. 


19.8 Prediction from models with outcome in logs 


For the models considered in this chapter, conditional prediction is an 
important application of the estimated parameters of the model. Such an 
exercise may be of the within-sample type, or it may involve comparison of 
fitted values under alternative scenarios, as illustrated in section 4.3. 
Whether a model predicts well within the sample is obviously an important 
consideration in model comparison and selection. 


Calculation and comparison of predicted values is relatively simpler in 
the levels form of the model because there is no retransformation involved. 
In the current analysis, the dependent variable is log transformed, but one 
wants predictions in levels, and hence the retransformation problem, first 
mentioned in section 4.2.3, must be confronted. 


Table 19.3 provides expressions for the conditional and unconditional 
means for the three models with outcome in logs rather than levels, 
presented in sections 19.4—19.6. The predictors are functions that depend 
upon the linear-index function, x’, and variance and covariance 
parameters, g2, 2, and 012. These formulas are derived under the twin 
assumptions of normality and homoskedasticity. The dependence of the 
predictor on variances estimated under the assumption of homoskedastic 
errors is potentially problematic for all three models because if that 
assumption is incorrect, then the usual estimators of variance and covariance 
parameters will be biased. 


Table 19.3. Conditional and unconditional means for models in logs 


Moment Model Prediction function 


E(y|x,y >0) Tobit exp(x’B + 07/2)[1 — ®{(y — x'B)/o}]~ 
a Oy 8 3") /o} 

E(y|x) Tobit exp(x’B + 0° /2)[1 — ®{(y — xB — o*)/o}| 

E(y2|x,y2 > 0) Two-part exp(x538, + 03/2) 

E(y2|x) Two-part exp(x3ß + 03/2)®(x}8;) 

E(yo|x,y2 >0) Selection exp(x8, + 03/2){1 — ®(—x'/B,)}7} 
{1 — &(—x}B, — o72)} 

E(y2|x) Selection exp(x}, + 02/2){1 — 6(—x‘/,B, — o7,)} 


19.8.1 Predictions from tobit in logs 


We begin by fitting E (y|x) and E(y|x, y > 0) for the tobit model in logs. 


. * Prediction from tobit regression with dependent variable lny 
. qui use mus219mepsambexp, clear 


. scalar gamma = -0.0000001 

. qui tobit lny $xlist, 11(gamma) vce(robust) 

. predict xb, xb // xb is estimate of x“b 

. matrix btobit = e(b) 

. scalar sigma = sqrt(btobit[1,e(df_m)+2]) // sigma is estimate of sigma 

. generate threshold = (gamma-xb)/sigma // gamma: lower censoring point 
. generate yhat = exp(xb+0.5*sigma~2) *(1-normal ( (gamma-xb-sigma”~2) /sigma) ) 


generate ytrunchat = yhat / (1 - normal(threshold)) if dy== 
(526 missing values generated) 


. summarize y yhat 


Variable Obs Mean Std. dev. Min Max 

y 3,328 1386.519 2530.406 (0) 49960 

yhat 3,328 45805.92 273444.7 133.9768 1.09e+07 
summarize y yhat ytrunchat if dy== 

Variable Obs Mean Std. dev. Min Max 

y 2,802 1646.8 2678.914 1 49960 

yhat 2,802 53271.51 297386.4 283.4537 1.09e+07 

ytrunchat 2,802 53536.85 297376.6 383.6245 1.09e+07 


The estimates, denoted by yhat and ytrunchat, confirm that these 
predictors are very poor. Mean expenditure is overpredicted in both cases 
and more so in the censored case. The reported results reflect the high 
sensitivity to estimates of g2, as exp(a?/2) appears multiplicatively in 
table 19.3. The tobit model has G2 — 7.735, whereas from section 19.5.3 the 
two-part model has a much, much lower G2 = 1.26962 = 1.612. 


19.8.2 Predictions from two-part model in logs 


Predictions of E(y2|x) and E’(y2|x, y2 > 0) from the two-part model are 
considerably better but still biased. We first transform the fitted log values 
from the conditional part of the two-part model, assuming normality. 


. * Two-part model predictions 
. qui probit dy $xlist 


. predict dyhat, pr 

. qui regress lny $xlist if dy== 

. predict xbpos, xb 

. generate yhatpos = exp(xbpos+0.5*e(rmse) ~2) 


Next, we generate an estimate of the unconditional values, denoted by 
yhat2step, by multiplying by the fitted probability of the positive 
expenditure dyhat from the probit regression. 


. * Unconditional prediction from two-part model 
. generate yhat2step = dyhat*yhatpos 


summarize yhat2step y 


Variable Obs Mean Std. dev. Min Max 

yhat2step 3,328 1680.978 2012.084 87 . 29432 40289.03 

y 3,328 1386.519 2530.406 (0) 49960 
summarize yhatpos y if dy== 

Variable Obs Mean Std. dev. Min Max 

yhatpos 2,802 1995.981 2087 .072 430.8354 40289.03 

y 2,802 1646.8 2678.914 1 49960 


The mean of the predicted values is considerably closer to the sample 
average than to those for the tobit estimator, confirming the greater 
robustness of the two-part model. 


19.8.3 Predictions from selection model 


Finally, we predict E(y2|x) and E(y2|x, y2 > 0) for the selection model. 


. * Heckman model predictions 
. qui heckman lny $xlist, select(dy = $xlist) vce(robust) 


. predict probpos, psel 
. predict x1b1, xbsel 
. predict x2b2, xb 
scalar sig2sq = e(sigma)~2 
. scalar sigi2sq = e(rho)*e(sigma) “2 


. display "sigmaisq = 1" " sigmal2sq = " sigi2sq " sigma2sq = " sig2sq 
Sigmalsq = 1 sigmal2sq = -.20055906 sigma2sq = 1.6147766 


. generate yhatheck = exp(x2b2 + 0.5*(sig2sq))*(1 - normal (-x1bi-sig12sq) ) 
. generate yhatposheck = yhatheck/probpos 


. Summarize yhatheck y probpos dy 


Variable Obs Mean Std. dev. Min Max 
yhatheck 3,328 1659.802 1937.095 74.32413 37130.18 
y 3,328 1386.519 2530.406 0 49960 
probpos 3,328 . 8415738 . 1411497 . 2029135 1 
dy 3,328 8419471 . 3648454 (0) 1 

. summarize yhatposheck probpos dy y if dy== 
Variable Obs Mean Std. dev. Min Max 
yhatposheck 2,802 1970.923 2003.406 389.4755 37130.18 
probpos 2,802 . 8661997 . 1237323 . 2867923 1 
dy 2,802 1 0 1 1 
y 2,802 1646.8 2678.914 1 49960 


The predictions from the selection model, denoted by yhatheck, are close to 

those from the two-part model. The main difference from the two-part model 
comes from introducing correlation in the errors, and here the correlation of 

— 0.124 is low. 


The poor prediction performance of the original tobit model confirms the 
earlier conclusions about its unsuitability for modeling the current dataset. 


19.9 Endogenous regressors 


The preceding analysis applies when regressors are exogenous. When 
instead some regressors are endogenous, standard estimators such as the 
tobit MLE become inconsistent, and alternative estimators are needed. The 
following methods that control for endogeneity rely on very strong 
distributional assumptions. 


19.9.1 Tobit model with endogenous regressors 


Endogenous regressors in the tobit model can be accommodated by using a 
structural model approach that is similar to that for the probit model 
presented in section 17.9. Cameron and Trivedi (2005, chap. 16.8) and 
especially Wooldridge (2010, chap. 17.5) provide details. 


If there is only one endogenous regressor, then the setup involves two 
equations—the structural equation of interest and the reduced form for the 
endogenous regressor. The structural model is exactly the same as (17.9) 
and (17.10), except now we observe yi; = yj; if, for example, yï; > 0. The 
framework assumes that the endogenous regressor 1s continuous, so the 
method should not be used for a discrete endogenous variable. The reduced- 
form equation for this variable must include at least one exogenous 
instrumental variable that affects the outcome variable only through the 
endogenous regressor and so is excluded from the structural equation. Error 
terms are assumed to be joint normally distributed. 


The ivtobit command is similar to the ivprobit command and 
computes both the MLE and the computationally simpler but less efficient 
two-step estimator of Newey (1987). The ivtobit command has options for 
ME, prediction, and variance estimation similar to those for the tobit 
command. 


If instruments are weakly correlated with the endogenous regressor, 
then the usual asymptotic theory may perform poorly. Then alternative 
methods may be used. Inference for weak instruments for the linear model 
is presented in section 7.7. The discussion there includes methods based on 


minimum distance estimation due to Magnusson (2010) that can be applied 
to a wide range of linear and nonlinear structural models. The community- 
contributed rivtest (Finlay and Magnusson 2009) and weakiv programs 
(Finlay, Magnusson, and Schaffer 2014) apply these methods following 
estimation using ivtobit. 


19.9.2 Richer models with endogenous regressors 


The eintreg and eregress commands provide ML estimates of a wide 
range of models with endogenous regressors or sample selection, or both. 
Furthermore, endogenous variables can be continuous, binary, and ordinal. 


These commands are members of the class of ERM commands for 
recursive models such as (17.9) and (17.10) with error terms that are joint 
normal distributed; see section 23.7 for details. 


19.9.3 Endogenous-switching regression model 


The sample-selection model leads to an observed outcome if the selection 
indicator yı > 0 and not observed otherwise. The endogenous-switching 
regression model instead specifies that one value of the outcome is 
observed if y; > 0 and a different value is observed if y; < 0. For example, 
the wage may vary according to whether a worker is a union worker. 


Then we observe 


yo =X5B,+62, ify >0 
y3 = X33, +63, ify <0 


where €2 and €3 are possibly correlated with the selection equation error 
El. 


Several community-contributed commands fit this model, including the 
command heckmancopula. 


19.10 Missing data 


Truncated regression models and sample-selection models are examples of 
models developed to handle missing data. (Censored regression such as for 
top-coded data, by contrast, has data that are incompletely observed rather 

than missing.) 


In this section, we provide a general discussion of various ways that 
data may be missing, the consequences (if any) of this missingness, and 
how to control for missing data. Wooldridge (2010, chap. 19) covers 
various selection mechanisms and methods to control for selection, most 
notably, inverse-probability weighting (Pw) and Heckman-type selection 
models. Cameron and Trivedi (2005, chap. 27) include coverage of multiple 
imputation and the statistics literature on missing-data mechanisms. Seaman 
and White (2013) provide a useful survey of inverse-probability weighting 
and contrast it with multiple imputation. 


This section focuses on cross-sectional data, while the following section 
considers application to panel-data attrition. Missing-data methods are also 
used in the treatment evaluation chapters 24—25 and in sections 30.5—30.6. 


19.10.1 Missing-data mechanisms 


The simplest case of missing data is where the cause of missing data is 
unrelated to either observed data or missing data. This case is referred to as 
data being missing completely at random (MCAR). In this case, there is no 
problem, and we can do the usual statistical analysis on just those 
observations for which complete data are available. For example, we drop 
from the analysis any observations for which data are incomplete. The only 
downside is a loss of estimator precision due to less data. 


If data are instead not MCAR, there are many different missing-data 
mechanisms and remedies and related terminology that differs across 
various branches of statistics. 


One set of terminology breaks data that are not MCAR into either missing 
at random (MAR) or missing not at random (MNAR). Missingness is MAR if 
the cause of the missing data is unrelated to missing data though is related 
to observed data. Missingness is MNAR if the cause of the missing data is 
related to missing data. 


For regression analysis, where distinction is made between endogenous 
variables and exogenous variables, a natural distinction is that between 
selection on observables only and selection on unobservables. 


Under selection on observables only, the missingness mechanism is 
purely random, after appropriate adjustment for exogenous variables. This 
is a conditional form of MAR that is also described as exogenous sample 
selection, ignorable selection (conditional on exogenous variables), or, in 
the treatment-effects literature, unconfoundedness. 


Under selection on unobservables, even after one controls for 
exogenous variables, the missingness mechanism depends on unobservables 
that are not independent of the outcome variable y. Then data are MNAR, a 
situation also described as endogenous sample selection or nonignorable 
selection (even after conditioning on exogenous variables). Much stronger 
assumptions are then needed to obtain consistent estimates given the 
selected sample. 


19.10.2 Complete-case analysis 


Consider the linear regression model y; = x/3 + u; and the OLS estimator 7 
with first-order conditions 5 i Xilyi — xB) = 0- The essential condition 


for consistency is that E(x;u;) = 0. A sufficient, though not necessary, 
condition for E(x;u;) = 0 is that E(u,;|x;) = 0. 


Complete-case analysis, also called case deletion or listwise deletion, 
performs OLS estimation using only those observations for which complete 
data are available. 


Introduce the selection indicator si, which equals 1 if complete data on 
(yi, Xi) are available and equals 0 otherwise. The ors estimator 3 cc using 


only available data (complete-case analysis) solves the first-order 
conditions 


sixi(yi — X,3) = 0 
i=l 


because only nonmissing data (with s; = 1) are included in this regression. 
The key assumption for consistency of Bc ç is then 


Condition (19.8) holds under MCAR because if missingness is purely 
random, then E(s;x;u;) = E(s;) x E(x;u;) = E(s;) x 0= 0, under the 
standard OLS assumption that E(x;u;) = 0. 


Condition (19.8) also holds under a conditional mean version of MAR 
that missingness is related only to the regressors x; and not to the error term 
ui. Specifically, we suppose that F’(u;|x;, si) = 0. Then 
E(sixiui) = Ex, ,s{ E (ui|Xi, si)} = Ex,s(0) = 0. A sufficient condition for 
E(uj|xX;i, si) = 01s that 1) E(u,;|x;) = 0, a stronger condition than Cov 
(Xi, ui) = 0; and 2) s; = h(x;) for some function h(-). Then 
Buy, 84) = Eu; xa hx, ) } = 2 (eel x) = 0. 


This analysis carries over more generally to the usual m-estimators such 
as the MLE and to Iv estimators; in the latter case, sufficient conditions are 
that F(u;|z;) = 0 and s; = h(z;), where Z; are the instruments. 


In summary, the commonly used practice of restricting analysis to only 
those observations with complete data is fine if 1) missingness is 
completely at random; or 2) if missingness is related only to included 
exogenous regressors and the model is correctly specified, for example, 
E(u;|x;) = 0 in the case of OLs. 


19.10.3 Inverse probability weighting 


Complete-case analysis leads to inconsistent OLS estimation if 
E(uj|x;, si) £ 0 or, equivalently, if F(y;|x;, si) # x, 3. Under some 
assumptions, a weighted least-squares (LS) estimator is consistent. 


We present results for the more general case of an m-estimator ĝ that 
solves 


where w; = (yi, Xi) and, for example, q(w;, 0) = xi(y; — x40) for OLs. 
The essential condition for consistency is that E{q(w;,0)} = 0. 


The complete-case estimator @,,. solves a | 8iq(w;, 0) = 0. The 
essential condition for consistency is that E'{s;q(w;,0)} =0. We are 
concerned with situations where this condition does not hold, because 
selection s; 1s not completely random or is not only determined by the 
exogenous regressors Xj. 


The key is to control for selection by introducing additional exogenous 
variables beyond the included regressors X;. Let z; be a set of variables that 


does not include y; but can include variables in x;, and let the probability 
that data are nonmissing be 


plz) = Pr(s;—"1|z,) 


where 0 < p(z;) < 1. Then the Pw estimator @,p, solves 


N ee 
> q(w;, 0) = 
i=l 


p(Zi) 


For @,py, to be consistent, we need E'{s;q(w;, 8) /p(z;)} = 0. In 
general, 
Esw2{8q(w)/p(2)} = Ew2lEs\wz{sa(w)/p(z)}|w, z] 
= Ew,2{Es\w.2(s|w, 2) x a(w)/p(z)} 
= Ew,2{Pr(s = 1|w,z) x a(w)/p(z)} 


where the last line uses the fact that for a (0, 1) random variable, 
E(s) = Pr(s = 1). It follows that a sufficient condition is that 


Pr(s; = 1|w;, Z;) = Pr(s; = 1|z;) 


because then 


Ew 2{Pr(s = 1|w, z)ą(w)/p(z)} = Ew,2{p(z)a(w)/p(z)} 
= w,z{q(w)} 
= 0 


where the last line uses the original assumption that E{q(w;,0)} = 0. 


Implementation of the Ipw estimator requires the weights p(z;). Sample 


survey data may directly provide the weights p(z;); see section 6.9. 


More often, the weights need to be estimated. A common approach is to 


specify and fit a logit (or probit) model for whether s; = 1. The weights 


then are p(z;,@), where q are logit parameter estimates. Strictly speaking, 
inference on Ow should then control for the first-step estimation of a; for 
example, one could bootstrap the two-step estimator. It has been shown that 


it is okay to ignore this complication and use the usual robust standard 
errors (heteroskedastic robust for independent data or cluster—robust for 
clustered data) because this leads, surprisingly, to overestimation of the 
standard error of @,py, and consequent conservative inference. 


The big practical issue is determining the variables to include in 2;. In 
some cases, the only missing data are those on y; for some observations, 
with x; only observed. Then we let z; = (x;, vi), where we desire 
additional variables V; that do not belong in the original model of interest 
but do make more credible the assumption that 
Pr(s; = 1]y;, Xi, vi) = Pr(s; = 1|x;, vi). For panel attrition, the pair 
(Yit, Xit) May be missing for some observations. Then for pooled estimators 
such as pooled OLS we need Zit such that 
Pr(sit = L[Yit, Xit, Zit) = Pr(si¢ = 1|Zi4). Such assumptions are called 
selection on observables only; once we condition on a large enough set of 
exogenous variables, selection is no longer related to the endogenous 
variable y. 


19.10.4 Endogenous sample selection 


The Pw estimator expands the set of exogenous variables, so that 
conditional on this expanded set, the data are satisfy the MAR assumption. 


Endogenous sample selection arises when data on some observations 
are missing in part because of values of endogenous variables even after 
controlling for exogenous variables. A simple example is truncation at zero 
where we observe only (y;,x;) if y; > 0; see section 19.3.5. 


A more complicated example is the selection model of section 19.6, 
where we may observe Yy: if a related variable crosses a threshold. We apply 
this model to panel data with attrition in section 19.11. Other endogenous 
sample-selection models include the endogenous-switching regression 
model of section 19.9.3. 


Endogenous selection models require much stronger parametric 
assumptions about error terms than those necessary to hold for exogenous 
selection. Thus, wherever justifiable, the assumption of selection on 
observables only is made. This allows use of methods that, conditional on 
this strong assumption, rely on fewer parametric assumptions. 


19.10.5 Imputation 


Complete-case analysis and IPW lead to complete loss of an observation 
whenever any part of (yi, Xi) is missing. When only data on some variables 
for some observations are missing, it is tempting to impute the missing data 
to create a complete observation. 


Care is needed, however. If the case-deletion estimate is inconsistent, 
then so will be estimates obtained using the simplest imputation methods. 
For example, suppose that we simply want to estimate E (y), data on some 
observations are missing, and the case-deletion estimate Ycc is inconsistent 
for E(y) because of nonrandom selection. Then replacing each missing 
observation on y; with the mean Ycc yields a sample average that again 
equals the inconsistent estimate Ycçc. Furthermore, even if there was no 
selection problem, the apparent increase in sample size will lead to an 
artificially lowered standard error of the estimate. 


Abrevaya and Donald (2017) propose a GMM method to obtain more 
efficient estimates for linear regression with independent heteroskedastic 
errors when there are missing observations on exogenous or endogenous 
regressors, under a variation of MAR suitable to their application. While this 
is a quite specialized setting, it arises often in applied microeconometrics 
studies. 


Multiple-imputation methods specify a stochastic model for the missing 
data as a function of the observed data. Random draws of the missing data 
are generated, rather than a deterministic value such as a predicted mean. 
Estimation is then based on several completed datasets composed of 
nonmissing data where available and imputed data otherwise. For details, 
see section 30.5. 


19.11 Panel attrition 


We present panel attrition here rather than in the panel chapters because it 
uses methods presented in sections 19.6 and 19.10. Readers already familiar 
with this material can tackle the current section as an immediate extension of 
chapter 8. Going the other way, readers of the current chapter may need to 
revisit chapter 8 to refresh familiarity with panel commands. 


For a variety of reasons, the participants in a survey may choose to not 
participate at all (survey nonresponse) or may choose to not answer particular 
questions in a survey (item nonresponse), sometimes because the questions 
address issues that respondents may consider sensitive. The resulting sample 
is said to be unbalanced because it leads to a different number of 
observations across sampled respondents; see Baltagi and Song (2006). 
Unbalanced panel samples are quite common. 


Panel attrition refers to loss of data in successive panel surveys that 
occurs because some of the original survey participants drop out of the 
survey at various stages. In some cases, the nonrespondent may permanently 
exit from the survey; in other cases, the nonrespondent may return to the 
sample in a later period. 


This section considers the consequences of ignoring sample attrition and 
the statistical remedies available to deal with the problem. The simplest 
approach is to assume that attrition is a completely random and ignorable 
process unrelated to either observed or unobserved characteristics of the 
respondents. More complicated approaches add a regression model that 
characterizes the attrition (or selection) process as a function of observable 
characteristics of respondents and, if observables alone cannot control for 
attrition, unobservable characteristics of respondents. This leads to a two- 
equation model like the one considered in section 19.6. 


Panel data add an additional dimension that is not present in the standard 
missing-data model. In multiperiod panels, the extent of missing data may 
grow as additional waves are added. For example, for the Michigan Panel 
Study of Income Dynamics, nearly 50% of the initial sample of 1969 was lost 
through attrition by 1989. One approach to deal with this issue is to use 


sample refreshment that adds additional survey respondents in later waves of 
the sample survey to adjust for the missing data from earlier waves. 


We focus on implementation of methods for panel attrition, along with a 
brief exposition of the underlying theory. A careful statement of assumptions 
used for a range of corrections for panel attrition is given in 
Wooldridge (2010, chap. 19.9). 


19.11.1 Attrition empirical example 


We use data from the first three waves of the Australian Medicine in 
Australia Balancing Employment and Life survey. This survey, introduced in 
section 4.6.2, collects data on doctors’ earnings. 


Specifically, we analyze log annual earnings (1yearn) of female general 
practitioners as a function of log annual hours worked, years of experience, 
squared years of experience, and an indicator of hospital work, which are 
variables labeled 1yhrs, expr, exprsq, and hospwork, respectively. Some IPW 
estimates given below use an indicator variable for self-employment 
(selfemp1) as an auxiliary variable. A much more complete set of regressors 
for these data is used in the panel attrition study by Cheng and 
Trivedi (2015). 


The data are stored as one line per (individual, wave) pair. If an 
individual is not interviewed in a given wave, then no observation appears for 
that wave. Note that an alternative way that the dataset could be configured is 
to include an observation even when there is no interview. Then a 
noninterview observation will provide data on key survey-specific variables 
such as the individual identifier and the wave, but all interview-specific 
variables would be set to missing. 


The extent of attrition 


We first drop any (individual, wave) pair for which an interview occurred but 
data are missing on variables used in this analysis. The xtdescribe 
command is then used to obtain the patterns of missing data due to 
noninterview. 


. * Pattern of missing observations in panel dataset 

. qui use mus219mabelunbalsmall, clear 

. drop if lyearn==.|lyhrs==.|expr==.|exprsq==.|hospwork==.|selfempl==. 
(5 observations deleted) 


. xtset 


Panel variable: id (unbalanced) 
Time variable: wave, 1 to 3, but with gaps 
Delta: 1 unit 


. xtdescribe 
id: 1100001, 1100003, ..., 3100399 n= 1933 
wave: 1, 2, ..., 3 T = 3 
Delta(wave) = 1 unit 
Span(wave) = 3 periods 
(id*wave uniquely identifies each observation) 
Distribution of T_i: min 5% 25% 50% 75% 95% max 
1 1 1 2 3 3 3 
Freq. Percent Cum. Pattern 
732 37.87 37.87 111 
338 17.49 55.35 1.. 
242 12.52 67.87 11. 
204 10.55 78.43 eat 
166 8.59 87.02 .11 
164 8.48 95.50 1. 
87 4.50 100.00 1.1 
1933 100.00 XXX 


Many individuals have missing data for at least one of the three waves. In 
some cases, an individual with data missing in one wave reappears in the 
next wave. And in some cases, individuals enter the sample after the first 
wave because the survey included refreshment sampling. In the survey, 732 
individuals have data available for all three waves, 495 (= 242 + 166 + 87) 
have data for 2 waves, and 706 (= 338 + 204 + 164) have data for 1 wave. 
In total, there are 3,892 (= 3 x 732 + 2 x 495+ 1 x 706) observations on 
1,933 individuals. 


The following gives summary statistics on the key variables. 


. * Summarize the data 
. summarize, sep(0) 


Variable Obs Mean Std. dev. Min Max 
id 3,892 1263677 454508.2 1100001 3100399 

wave 3,892 1.946043 .8137644 1 3 
yearn 3,892 134626 .3 88295.98 1000 1002000 
yhrs 3,892 1684.091 711.9675 52 5200 
female 3,892 1 (0) 1 1 
expr 3,892 18.43358 9.836534 (0) 58 
hospwork 3,892 . 1682939 .3741752 (0) 1 
selfempl 3,892 . 2546249 -4357061 (0) 1 
lyearn 3,892 11.62464 .6317774 6.907755 13.81751 
lyhrs 3,892 7.317732 .5202118 3.951244 8.556414 
exprsq 3,892 436.5295 403.556 (0) 3364 


We create a set of selection variables s1—s3 that indicate for each wave 
for which an individual appears in the dataset which of the three waves he or 
she appears in. For example, for an individual with missing pattern 1.1, we 
have s1=1, s2=0, and s3=1 in each of wave==1 and wave==3, while there is no 
observation in the dataset for wave== 


. * Selection indicators for each(id,wave) indicate which waves are available 
. generate si = 0 


. qui replace si = 1 if wave== 

. qui replace si = 1 if (wave==2 & !missing(L.lyearn) ) 
. qui replace si = 1 if (wave==3 & !missing(L2.lyearn) ) 
. generate s2 = 0 

. qui replace s2 = 1 if wave== 

. qui replace s2 = 1 if (wave==1 & !missing(F.lyearn)) 
. qui replace s2 = 1 if (wave==3 & !missing(L.lyearn) ) 
. generate s3 = 0 

. qui replace s3 = 1 if wave== 

. qui replace s3 = 1 if (wave==1 & !missing(F2.lyearn) ) 
. qui replace s3 = 1 if (wave==2 & !missing(F.lyearn) ) 
. generate balanced = (s1==1 & s2==1 & s3==1) 


save mus219mabelfinal, replace 
(file mus219mabelfinal.dta not found) 
file mus2i9mabelfinal.dta saved 


mus219mabelfinal.dta with these indicators is saved for subsequent 
analysis. It includes the indicator balanced, which equals one for individuals 
with data available in all three waves. 


As an example, the following tabulates counts of s2 for each of the three 
waves. 


* Count of how many wave 2 also available in wave 1 and wave 3 
. tabulate s2 wave 


Survey wave 


s2 1 2 3 Total 
(0) 425 (0) 291 716 
1 974 1,304 898 3,176 


Total 1,399 1,304 1,189 3,892 


In the survey, 974 of the 1,399 present in wave 1 were also present in wave 2, 
and 898 of the 1,189 present in wave 3 were also present in wave 2. 


For those present at wave 1, we compare the wave 1 earnings for those 
who were not present in wave 3 (s3==1) with those who were still present in 
wave 3 (s3==0). This compares 580 individuals with missing pattern 1.. or 
11. to 819 individuals with pattern 111 or pattern 1.1. 


* Compare wave 1 earnings by wave 3 attrition status 

capture drop attrit* 

twoway (kdensity lyearn if (wave==1 & s3==1), clstyle(p1)) 
(kdensity lyearn if (wave==1 & s3==0), clstyle(p2)), 
title("Wave 1 earnings by wave 3 attrition status") 
legend(label(1 "No attrition") label(2 "Attrition") ) 
legend(pos(10) ring(0) col(1)) 


VVVVMs 


The first panel of figure 19.1 shows little difference in wave 1 earnings by 
attrition status in wave 3. 
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Figure 19.1. Log earnings by attrition status for wave 1 and for 
wave 3 


We next compare, for those in the original wave 1 sample, the wave 3 
earnings for the 732 individuals in wave 2 with the 87 individuals who 
missed wave 2. 


* Compare wave 3 earnings by whether missed wave 2 
generate wave2skip = 
(3,892 missing values generated) 


. qui replace wave2skip = 1 if (s1==1 & s2==0) 

. qui replace wave2skip = 0 if (sl==1 & s2==1) 

twoway (kdensity lyearn if (wave==3 & wave2skip==0), clstyle(p1)) 
(kdensity lyearn if (wave==1 & wave2skip==1), clstyle(p2)), 
title("Wave 3 earnings by missing wave 2") 
legend(label(1 "No attrition") label(2 "Attrition")) 
legend(pos(10) ring(0) col(1)) 


VvVvVVMVMs 


The second panel of figure 19.1 shows that the wave 3 log-earnings 
distribution for those who skipped wave 2 appears to be around 0.10 to the 
left of that for those who did not, corresponding to a 10% lower level of 
earnings. 


We perform a difference in means test. 


. * Difference in means test of wave 3 earnings by whether missed wave 2 
. ttest lyearn if wave==3, by(wave2skip) unequal 


Two-sample t test with unequal variances 


Group Obs Mean Std. err. Std. dev. [95% conf. interval] 

0 732 11.67776 .0229297 .6203751 11.63274 11.72278 

1 87 11.53076 . 0693067 . 6464502 11.39299 11.66854 

Combined 819 11.66215 0218196 6244371 11.61932 11.70497 

diff . 1469954 0730013 0022585 . 2917322 

diff = mean(0O) - mean(1) t= 2.0136 

HO: diff = 0 Satterthwaite’s degrees of freedom = 105.708 
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 


Pr(T < t) = 0.9767 Pr(|TI| > Itl) = 0.0466 Pr(T > t) = 0.0233 


The difference in means of 0.147 is statistically significant at level 0.05. The 
command regress lyearn wave3skip if wave==3, vce(robust) leads to 
equivalent results, aside from slight difference in degrees-of-freedom 
correction. 


19.11.2 Unbalanced and balanced panel analysis 


The most common procedure when there is panel attrition is to assume that 
attrition does not affect the consistency of the usual estimators. This requires 
the exogenous sampling assumption that data are MAR conditional on 
regressors. In the case of pooled OLS, for example, we assume that 
E(uie|Xi¢) = 0 and that s; = h(x;,) for some function h(-). 


In this situation, we can use either the original unbalanced sample or the 
balanced sample. The unbalanced sample has the advantage of having more 
observations. The balanced sample can have the advantage that some 
estimation methods may apply only to balanced samples, though the basic 
panel commands can be implemented on either unbalanced or balanced 
samples. 


Creating a balanced dataset 


One way to create a balanced dataset is to drop any individual who is not 
observed in all three waves. We earlier defined variable balanced = (s1==1 
& s2==1 & s3==1). 


. * Manually create balanced dataset 
. keep if balanced == 
(1,696 observations deleted) 


. xtdescribe 
id: 1100001, 1100003, ..., 1103899 n= 732 
wave: 1, 2, ..., 3 T= 3 
Delta(wave) = 1 unit 
Span(wave) = 3 periods 
(id*wave uniquely identifies each observation) 
Distribution of T_i: min 5% 25% 50% 75% 95% max 
3 3 3 3 3 3 3 
Freq. Percent Cum. Pattern 


732 100.00 100.00 111 


732 100.00 | XXX 


Of the original 3,892 observations, 1,696 were dropped, leaving 2,196 
(= 3892 — 1696) observations on 732 (= 2196/3) individuals. 


An alternative is to use the community-contributed xtbalance command 
(Yujun 2009). This command has a required option range () that gives the 
range of the time identifier to be used. Here we want to use wave equal to 1, 
2, and 3, so we use option range (1 3). Again reading in the original dataset, 
we use the miss () option to additionally drop observations that are missing 
data on those variables to be used in the analysis below. 


* xtbalance command creates a balenced dataset 
. qui use mus219mabelunbalsmall, clear 


. global xlist lyhrs expr exprsq hospwork 
xtbalance, range(1 3) miss(lyearn $xlist) 
(5 observations deleted due to missing) 


(1696 observations deleted due to discontinues) 


. xtdescribe 
id: 1 

wave: 1 

D 

S 


(id*wave uniquely identifies each observation) 


100001, 1100003, ..., 1103899 
E S 

elta(wave) = 1 unit 

pan(wave) = 3 periods 


Distribution of T_i: min 5% 25% 50% 


Freq. 


3 3 3 3 


Percent Cum. Pattern 


732 


100.00 100.00 111 


732 


100.00 XXX 


OLS, random-effects, and fixed-effects estimates 


95% 
3 


732 


We obtain OLS, random-effects (RE), and fixed-effects (FE) estimators for 
unbalanced and balanced samples. 


* OLS, RE, and FE regressions for unbalanced sample 


* and for balanced sample 


. qui use mu 
. qui regres 
. estimates 
. qui xtreg 
. estimates 
. qui xtreg 


. estimates 


s219mabelfinal, clear 
s lyearn $xlist, vce(cluster id) 
store OLS_Unbal 

lyearn $xlist, re vce(robust) 
store RE_Unbal 

lyearn $xlist, fe vce(robust) 
store FE_Unbal 


// Unbalanced sample 


max 


. qui regress lyearn $xlist if balanced==1, vce(cluster id) // Balanced sample 


. estimates store OLS_Bal 


. qui xtreg lyearn $xlist if balanced==1, re vce(robust) 


. estimates store RE_Bal 


. qui xtreg lyearn $xlist if balanced==1, fe vce(robust) 


. estimates store FE_Bal 


The following table compares the results. 


. * Table comparing OLS, RE, and FE results for balanced 

. * and unbalanced data 

. estimates table OLS_Unbal OLS_Bal RE_Unbal RE_Bal FE_Unbal FE_Bal, 
> b(4%47.3f) se stats(N r2) 


Variable OLS_U~1 OLS_Bal RE_Un71 RE_Bal FE_Un“1 FE_Bal 


lyhrs 0.758 0.832 0.609 0.617 0.294 0.305 
0.027 0.037 0.031 0.041 0.040 0.052 

expr 0.013 -0.003 0.016 -0.000 -0.142 -0.133 
0.003 0.005 0.003 0.006 0.033 0.039 

exprsq -0.000 0.000 -0.000 -0.000 0.002 0.002 
0.000 0.000 0.000 0.000 0.001 0.001 

hospwork 0.044 0.063 0.023 0.058 -0.006 0.028 
0.027 0.040 0.024 0.033 0.038 0.042 

_cons 5.958 5.598 7.018 7.132 11.116 11.152 
0.198 0.261 0.225 0.296 0.401 0.546 

N 3892 2196 3892 2196 3892 2196 

r2 0.405 0.435 0.079 0.081 


Legend: b/se 


Despite the large difference in the number of observations in unbalanced 
versus balanced datasets (3,892 versus 2,196), for a given estimator, there is 
little difference in the coefficient of the highly statistically significant 
regressor lyhrs. Much more pronounced is the difference across the OLS, RE, 
and FE estimates. As expected, the unbalanced estimates are more precise 
because they are based on more observations. 


First-difference estimates 


The individual-specific FE model has the attraction that additionally 
conditioning on the fixed effect makes the conditional MAR assumption more 
reasonable. 


Given yit = a; + x), + uz, we assume that E (uit|@i, Xit) = 0 and that 
sit = h(Xit) for some function h(-). So we assume that after controlling for a 
time invariant individual-specific fixed effect, attrition depends only on the 
included regressors. Fixed-effects or first-difference estimators then eliminate 
these individual-specific fixed effects. 


The first-difference estimates are obtained as follows: 


. * Compare first-difference results for balanced and unbalanced data 
. global FDxlist D.lyhrs D.expr D.exprsq D.hospwork 


. sort id wave 

. qui regress D.lyearn $FDxlist, cluster(id) 

. estimates store FD_Unbal 

. qui regress D.lyearn $FDxlist if balanced==1, cluster(id) // Balanced sample 


. estimates store FD_Bal 


. estimates table FD_Unbal FD_Bal, b(%7.4f) se stats(N r2) 


Variable FD_Un~1 FD_Bal 


lyhrs 
D1. 0.2766 0.3034 
0.0440 0.0562 

expr 
D1. -0.0970 -0.1142 
0.0372 0.0455 

exprsq 
D1. 0.0021 0.0020 
0.0008 0.0010 

hospwork 
D1. 0.0051 0.0316 


0.0421 0.0453 


_cons 0.0321 0.0171 
0.0135 0.0161 


N 1872 1464 
r2 0.0600 0.0676 


Legend: b/se 


The results are similar across the two samples and are similar to the 
preceding FE estimates. For the larger unbalanced dataset, the elasticity of 
annual earnings with respect to hours is 0.277 and is very precisely estimated 
with t = 6.29. The experience variables are jointly statistically significant at 
level 0.05. 


19.11.3 Inverse-probability weighting 


We now suppose that conditioning on regressors included in the model is 
insufficient to control for bias induced by attrition. 


The Pw method overcomes this by adding additional variables as controls 
for selection. For brevity, we illustrate the Pw method for OLS estimation in 
first differences. Analysis for OLS estimation in levels, possible when it is 
reasonable to use a model without individual-specific effects, is qualitatively 
similar. 


Given yi = a; + XB + 5,6 + Uit where Xiz are time-varying 
variables and X2; are time-invariant variables, first differencing yields 


Ayit = Axj,8 + Aui 


Note that the time-invariant variables have disappeared. 


Let the selection indicator s;; = 1 if data on individual ¿ are available at 
time t and s; = 0 otherwise. Then data for the first-difference regression in 
time ¢ are available if s;, = 1 and 5;,4-1 = 1. The simplest approach is to 
assume that attrition in any period t depends only on base period variables 
Zi1; later analysis relaxes this specification. We assume that these variables 
alone explain any attrition, so that 


Pr(si¢ = 12i1, Si2-1 = 1, Ay, Axi) = Pr(sit = lai) = bzar) (19.9) 


We specify a probit model because the sample-selection model presented 
later uses the probit. The logit model is more commonly used for IPw; in 
practice, the weights obtained using either logit or probit will be similar. 


The key decision is what variables to include in Z;1. It can include at least 
Xit at t = 1 and the time-invariant regressors (X2;) that were first-differenced 
out. 


Given estimates ¥ of the probit model, the IPw estimator solves the 
weighted Ls first-order conditions 


Ziv 


N 
aera Bai 5, Avie (Avie — AxB) = 
=] 


This can be implemented in Stata using the regress command with 
dependent variable Ay;:, regressors Ax;,, and command qualifier 
[pweight=weight], where weight equals s;;/®(z‘,47). This weight equals 0 
for missing observations and 1/®(z‘,77) for nonmissing observations. 


Filled-in panel dataset 


The current dataset includes only observations with s;, = 1. The subsequent 
analysis requires a balanced dataset with all three waves per individual; in 
waves where the individual was not interviewed, all variables will then be set 
to missing. 


To do so, we use the fillin command, which adds observations with 
missing data so that the dataset includes all interactions of the variables 
provided as arguments of the command, here id and wave. We include a 
listing of some observations to make clear that the observations added have 
missing values for all variables but ia and wave. We also redefine the 
selection indicators si—s3. 


. * Expand dataset to include missing waves 2 and 3 given present in wave 1 
. qui use mus219mabelfinal, clear 


. fillin id wave 


. sort id wave 


. list id wave s1 s2 s3 lyearn _fillin in 13/24, clean 


id 
13. 1100014 
14. 1100014 
15. 1100014 
16. 1100015 
IT: 1100015 
18. 1100015 
19. 1100016 
20. 1100016 
21. 1100016 
22. 1100018 
23. 1100018 
24. 1100018 


wave S 


WNHRFWNHRWNHRWNE 


. generate swave = sl 
(1,907 missing values generated) 


. replace swave 


. drop s1 s2 s3 


. generate si 
. qui replace 
. qui replace 
. qui replace 
. generate s2 
. qui replace 
. qui replace 
. qui replace 
. generate s3 
. qui replace 
. qui replace 
. qui replace 
sum id wave 
Variable 

id 

wave 

s1 

s2 

s3 


lyearn 
swave 


s1 
si 
si 


0 


1 
1 
1 


if si==0 
(700 real changes made, 700 to missing) 


1 
1 


1 
1 
1 


s2 s3 
1 1 
1 1 
1 1 
0 (0) 
(0) (0) 
1 (0) 
1 (0) 


lyearn 
10.81978 
10.59663 
10.50507 
11.60824 


12.25486 


11.28978 
11.28978 


'missing(lyearn)) 


'missing(lyearn)) 


'missing(lyearn)) 


Std. dev. 


585159.2 

.816567 
.4471828 
. 4685649 
. 4866122 
.6317774 


if (wave==1 & 
if (wave==2 & 
if (wave==3 & 
if (wave==2 & 
if (wave== 
if (wave==3 & 
if (wave==3 & 
if (wave==1 & 
if (wave==2 & 
s2 s3 lyearn swave, sep(0) 
Obs Mean 
5,799 1373326 
5,799 2 
5,799 . 7237455 
5,799 .6745991 
5,799 .6151061 
3,892 11.62464 
3,192 1 


0 


_fillin 


FPOOrRFRFPFOrRFF OOOO 


!missing(L.lyearn) ) 


!missing(L2.lyearn) ) 


missing (F.lyearn) ) 


Imissing(L.lyearn) ) 


missing (F2.lyearn) ) 


missing (F.lyearn) ) 


Min 


1100001 
1 

0 

0 

0 
6.907755 
1 


Max 


3100399 
3 

1 

1 

1 
13.81751 
1 


The added observations have missing values for all variables aside from id 
and wave. The dataset is now a square dataset with 5,799 observations on 
1,933 individuals (= 5799/3), but nonmissing data are available for only 
3,892 observations. The s3 variable, for example, indicates that there are 
0.6151 x 5799/3 = 1189 observations with nonmissing wave-3 data. 


IPW estimates with selection based on wave-1 data 


We first consider estimation of the first-difference equation for log-earnings 
using only wave-2 data. 


To control for attrition, we fit a probit selection model for s;ọ = 1, where 
we use as Z;1 the value of the level of the log-earnings equation variables in 
wave |. We have 


* IPW: (1) Probit selection equation for wave 2 given in wave 1 
. global IPWxlist L.lyhrs L.expr L.exprsq L.hospwork 


. probit s2 $IPWxlist if wave== 


log likelihood = -859.04215 
log likelihood = -852.88607 
log likelihood = -852.88502 
log likelihood = -852.88502 


Iteration 0: 
Iteration 1: 
Iteration 2: 
Iteration 3: 


Probit regression Number of obs = 1,399 
LR chi2(4) = 12.31 
Prob > chi2 = 0.0152 
Log likelihood = -852.88502 Pseudo R2 = 0.0072 
s2 | Coefficient Std. err. z P>\|zl [95% conf. interval] 
lyhrs 
L1. .0276565 .067977 0.41 0.684 -.1055759 . 160889 
expr 
Li. .0356527 .0133619 2.67 0.008 .0094638 .0618416 
exprsq 
Li. -.0006163 .0003051 -2.02 0.043 -.0012143 -.0000183 
hospwork 
Li. - .0429673 .0960163 -0.45 0.655 -.2311558 . 1452212 
_cons - .0872095 .49163 -0.18 0.859 -1.050787 . 8763676 
. qui predict s2prob if e(sample)== 
. tabulate s2 if e(sample)== 
s2 Freq. Percent Cum. 
(0) 425 30.38 30.38 
1 974 69.62 100.00 
Total 1,399 100.00 
. tabstat s2prob if e(sample)==1, stats(count min p10 p50 p90 max) 
Variable N Min p10 p50 p90 Max 


s2prob 


1399 


.5250311 


. 6233922 


.714309 .7368833 .7437167 


This estimate is based on 1,399 individuals, 974 individuals observed in both 
waves | and 2 (so wave-? first-differences data are available), and 425 
present in wave | but not in wave 2. The predicted probabilities range from 
0.525 to 0.744. The probability of continuing to the wave-2 survey at first 
increases with years of experience and then decreases after 29 years [ 

= 0.0356/(2 x 0.000616)]. 


We then obtain the IPW estimates by weighted OLS regression for the 974 
individuals observed in both waves | and 2. 


. * IPW: (2) Generate weights and do weighted and unweighted first-difference OLS 
. qui generate s2weight = s2/s2prob if wave== 


. tabstat s2weight if wave==2 & s2==1, stats(count min p10 p50 p90 max) 
Variable N Min p10 p50 p90 Max 
s2weight 974 1.344598 1.356602 1.397984 1.590194 1.824038 


. regress D.lyearn $FDxlist [pweight=s2weight] if wave==2, vce(robust) // Weighted 
(sum of wgt is 1,398.77602756023) 
Linear regression Number of obs = 974 
F(4, 969) = 11.33 
Prob > F = 0.0000 
R-squared = 0.0764 
Root MSE = -43776 
Robust 
D.lyearn | Coefficient std. err. t P>[t| [95% conf. interval] 
lyhrs 
D1. . 3036735 .0499019 6.09 0.000 . 2057453 .4016017 
expr 
D1. -.079066 .0930013 -0.85 0.395 -.2615731 . 1034412 
exprsq 
D1. .0022219 . 0008784 2.53 0.012 . 0004981 . 0039457 
hospwork 
D1. .0528654 .0589415 0.90 0.370 -.0628022 . 168533 
-cons .0617818 .0846183 0.73 0.465 -. 1042745 . 227838 


. estimates store IPW2 


The weights range from 1.345 to 1.824. Observations are multiplied by the 
square root of these weights, so there is at most a 16% difference (because 
4/1.824/1.345 = 1.16) in the weights on any single observation, a modest 


difference. The estimates are similar to the unweighted first-difference OLS 
estimates obtained previously. 


The following code repeats the exercise for OLS estimation of the first- 
difference equation using only wave-3 data. We fit a probit selection model 
for s;3 = 1, where we again use as 2; the value of variables in wave 1. It is 
important that the probit model estimation be restricted to respondents with 
data also available in wave 2 because subsequent weighted first-difference Ls 
estimation in wave 3 requires wave-2 data. We have 


. * IPW: Now do wave 3 with selection on wave 1 variables 
. global IPWxlist3 L2.lyhrs L2.expr L2.exprsq L2.hospwork 


. qui probit s3 $IPWxlist3 if wave==3 & s2== 
. qui predict s3prob if e(sample) 
. tabulate s3 if e(sample)== 


s3 Freq. Percent Cum. 
(0) 242 24.85 24.85 
1 732 75.15 100.00 
Total 974 100.00 
. qui generate s3weight = s3/s3prob if wave== 
. tabstat s3prob s3weight if e(sample)==1, stats(count min p10 p50 p90 max) col(s) 
Variable N Min p10 p50 p90 Max 
s3prob 974 .4396037 .6424823 .7758189 .8131418 .9041119 
s3weight 974 0 O 1.259265 1.454074 2.100374 
. qui regress D.lyearn $FDxlist [pweight=s3weight] if wave==3, vce(robust) 


. estimates store IPW3 


The probit estimates are based on 974 individuals, 732 observed in waves 2 
and 3 and 242 observed in wave 2 but not wave 3. The final weighted 
estimates use the 732 observed in waves 2 and 3. The estimates will be 
presented in a table below. 


We then combine the two waves and do pooled weighted Ls estimation of 
the first-difference equation using waves 2 and 3, using the previously 
obtained weights. We have 


. * IPW: Estimate waves 2 and waves 3 together 
. Qui generate sweight = . 


. qui replace sweight = s2weight if wave== 

. qui replace sweight = s3weight if wave== 

. tabstat sweight if sweight!=0, stats(count min p10 p50 p90 max) col(s) 
Variable N Min pio p50 p90 Max 


sweight 1706 1.106058 1.248824 1.372499 1.555892 2.100374 


. qui regress D.lyearn $FDxlist [pweight=sweight], vce(robust) 
. estimates store IPWboth 


The estimates will be presented in a table below. 


IPW estimates with selection based on time-varying data 


The previous analysis uses only data available at wave 1 to estimate the 
selection probabilities. We could instead use more recent data as it becomes 
available. 


Thus, in the probit selection equation for wave 3, we could use wave-2 
data. In that case, subsequent oLs estimation should use a weight of 
siz / {8 (z172) x #(z173)}- This requires only a minor change to the code 
given previously. It does require stronger assumptions to ensure that the 
IPW estimator is consistent, however; see Wooldridge (2010, 842). 


IPW based on auxiliary regressors 


A separate issue is that the Pw estimator places high weight on observations 
with the probability that s;, = 1 predicted to be close to zero. This can 
potentially generate instability in the weighted regressions. 


An alternative IPW estimator reduces the likelihood of this by estimating 
two different binary outcome models. One model leads to predicted selection 
probability ®(z/, 7), as before. A second model adds auxiliary variables vi, 
say, leading to selection probability D(z Y + vl ð) . The [pw estimator then 


a 


~ 


uses weight s; x 6(z/,Y)/®(z.,7 + vô). 


The following example implements this alternative weighting for wave-2 
first-difference OLS. We add sel femp1, an indicator variable equal to one if 
self-employed, as an auxiliary variable. Then, 


. * IPW: Alternative weights based on auxiliary variables wave 2 versus wave 1 
. qui probit s2 $IPWxlist L.selfempl if wave== 


. qui predict s2probaugment if e(sample) 
. qui generate s2relweight=s2prob/s2probaugment 


. tabstat s2relweight if e(sample)==1, stats(count min p10 p50 p90 max) col(s) 
Variable N Min p10 p50 p90 Max 
s2relweight 1399 .9745781 .980992 .9900049 1.031482 1.074542 


. sum s2prob s2probaugment s2relweight 


Variable Obs Mean Std. dev. Min Max 
s2prob 1,399 -6961762 0435256 .5250311 . 7437167 
s2probaugm™t 1,399 .6961827 .0460661 . 4950356 . 7599604 
s2relweight 1,399 1.000462 .021652 .9745781 1.074542 
. qui regress D.lyearn $FDxlist [pweight=s2relweight] if wave==2, vce(robust) 


. qui estimates store IPW2Aug 


Comparison of IPW estimates 


The following table compares unweighted first-differences OLS estimates 
using wave-2 and 3 data with the various IPW estimates. 


. * Unweighted, IPW, and sample-selection estimates: First-difference in wave 2 
. qui regress D.lyearn $FDxlist if wave==2, vce(robust) noheader // Unweighted 


. estimates store FD_UNW 
. estimates table FD_UNW IPW2 IPW3 IPWboth IPW2Aug, b(%7.4f) eq(1) se stats(N r2) 


Variable | FD_UNW IPW2 IPW3 IPWboth IPW2Aug 
lyhrs 
D1. 0.3034 0.3037 0.3083 0.3054 0.3019 


0.0505 0.0499 0.0776 0.0424 0.0504 


expr 
D1. -0.0677 -0.0791 0.0121 -0.1148 -0.0713 
0.0913 0.0930 0.1496 0.0394 0.0924 

exprsq 
D1. 0.0022 0.0022 0.0000 0.0021 0.0022 
0.0009 0.0009 0.0033 0.0008 0.0009 

hospwork 
D1. 0.0598 0.0529 -0.0235 0.0198 0.0591 


0.0601 0.0589 0.0526 0.0406 0.0598 


-cons 0.0710 0.0618 0.0161 0.0185 0.0677 
0.0829 0.0846 0.0164 0.0162 0.0840 


N 974 974 732 1706 974 
r2 0.0737 0.0764 0.0643 0.0722 0.0728 


Legend: b/se 


There is little difference in the highly statistically significant regressor A 
lyhrs across the various models. 


19.11.4 Selection-based panel attrition model 


We continue with the example of first-difference OLS with attrition. The Pw 
estimator can provide consistent estimates in situations where balanced or 
unbalanced estimation is inconsistent. It requires the conditional MAR or 
selection-on-observables assumption (19.9); that is, that 

Pr(sit a 1|Z;1, Si t—-1 = 1, Ayit, AXit) = Pr(sit = 1|z,41). 


If this assumption fails, then we are in a setting of MNAR or selection on 
unobservables. Then we use a two-equation system for Ay;+ and Sit with 
errors that are correlated across equations. This model is a panel extension of 
the selection model of section 19.6. 


We first consider a model without fixed effects. Define latent variables 
for the outcome (y) and selection indicator (s), 


* / / 
Vit = Xub + Xo,Bo + Elit 


* / 
Sit = Zi Y F E2it 


where the equation errors (€1it, E2it) may be correlated. In the two- 
component vector (x;,2 X2; ), the first component x;¢ consists of time- 
varying regressors, and the second component X2; consists of time-invariant 
regressors. We observe 


Yit = Sit X Yit 
sa = (s; > 0) 


In this selection model, the random shock, €2, which affects the 
probability of attrition is potentially correlated with the shock €1, which 
affects the outcome. Ignoring this correlation, as when the outcome equation 
is estimated under conditional MAR assumptions, results in selection bias. 


A number of panel-data estimators are available for estimating the 
selection model; see Wooldridge (2010, chap. 19.9). As in the case of the 
classic selection model for cross-sectional data, robust identification of the 
outcome parameter requires that the attrition equation contain some 
nontrivial regressors that do not directly affect the outcome. One potential 
difference from the cross-sectional case, however, comes from the possibility 
that these variables can vary over t. Furthermore, with attrition, not only Yit 
is missing when s,; = 1 but also Xit. 


First-difference model 


Rather than illustrate sample-selection methods for the preceding model in 
levels, we analyze the qualitatively similar first-difference model that has the 
advantage of being robust to the presence of time-invariant individual- 
specific effects. 


The selection model equations are then 


Ayy = Axi, 3 eG 


* / 
Sit = Ziy F Eit 
where the equation errors (Aé1;1, E2it) may be correlated. We observe 


Ayit = Sit x Ay, 
Sit = 1(s%, > 0) 


We first consider the selection probability. Assume that, conditional on 
Sit—1 = l and Zz, the selection error €;, ~ N (0, 1). Then, 


Pr(si¢ = 1], 4-1 = 1, Zi) = D(z) 


a probit model. 


Second, assume that the error in the outcome equation is a multiple of the 
selection error plus independent noise, so 


Ae iit = PE2it + Nit 


where nit is an independent error and we again conditional on si 4-1 = 1. By 
previous results in section 19.6, it follows that 


E(AyilSit-1 = 1) = AX B + pA(Zi7) 


where A(z) = $(z)/®(z) is the inverse Mills ratio. 
A consistent estimator of A(z/,y), denoted ),,,, is generated by the probit 


equation for the attrition event. Then, estimate by pooled OLS using only the 
nonmissing observations the equation 


Ayit = Ax; B + prt {ri + AE + pt (ri — ie) } 


where the three terms inside the curly brackets define the composite error on 
the outcome equation. Under the assumption that all elements of Ax,; are 
uncorrelated with the composite error term, the LS estimator is a consistent 
estimator. However, because },, is a generated regressor, inference should be 
based on, for example, a cluster bootstrap that is a cluster variant of the two- 
step bootstrap given in section 12.4.5. 


A test of the null hypothesis of conditional MAR against the alternative of 
selection bias may be based on Ho : p = 0 versus H; : p Æ 0. Given quite 
strong assumptions involved in its implementation and the complexity of the 
robust variance estimator, the outcome of the test should be treated with 
caution. 


More generally, we may estimate the selection equation separately in 
each time period. So assume that Pr(si = 1]si4-1 = 1, Zi) = ®(zi,y,) and 
Aéiie = Pr€2it + Nit, Which in turn generates the time-varying Mills ratio 
term A(zi,y,). After obtaining A(z! Y+) by separate probit estimation for 
each time period, we then do pooled OLS estimation for all periods of the 
model 


Ayi = AXi B + paddgds Sas prdDi dit + Vit 


where dj;, = 1 ift = j and dj; = 0 otherwise. Again, inference should be 
based on a cluster bootstrap for a two-step estimator. 


An alternative specification is that in which one or more elements of Xit 
are endogenous. Then an Iv or GMM-type estimator would be preferred. The 
usual caveats regarding the choice of instruments will apply, and note that the 
presence of serially correlated errors will affect both the selection of valid 
instruments and the appropriate variance estimator. 


Selection model estimates 


Here we present the MLE for the pooled data, although there are reasons to 
prefer the two-step estimator. Two-step estimation is left as an exercise for 
the interested reader. 


We use as the variables Z: in the selection equation the lagged-level 
values of the variables in the outcome equation, as well as the lagged value of 
the self-employment indicator variable selfemp1. 


We first fit a pooled version of the Heckman selection model, jointly 
estimating both the selection equation and the (first-differenced) outcome 
equation over waves 2 and 3. We use ML estimates under the additional 
assumption of normality because this can provide cluster—robust standard 
errors. 


. * Sample-selection model: Pooled first-difference OLS using wave 2 and 3 data 
. generate dsel =. 
(5,799 missing values generated) 


. replace dsel = s2 if wave==2 
(1,933 real changes made) 


. replace dsel = s3 if wave== 
(1,933 real changes made) 


. tabulate dsel wave, missing 


Survey wave 


dsel 1 2 3 Total 
(0) (0) 629 744 1,373 

1 (0) 1,304 1,189 2,493 
1,933 (0) (0) 1,933 


Total 1,933 1,933 1,933 5,799 


. heckman D.lyearn $FDxlist, select (dsel 
> nolog vce(cluster id) 


Heckman selection model 


(regression model with sample selection) 


Log pseudolikelihood = -2672.479 


$IPWxlist L.selfemp1) 


Number of obs 


Selec 
Nonse 


Wald chi2(4 
Prob > chi2 


ted 
lected 


) 


2,703 
1,872 
831 


46.44 
0.0000 


(Std. . adjusted for 1,729 clusters in id) 
Robust 
Coefficient std. err. P>|zl [95% conf. interval] 
D_lyearn 
lyhrs 
D1. . 2774803 . 0438924 0.000 . 1914528 . 3635078 
expr 
D1. .0919311 . 0392842 0.019 -. 1689266 .0149355 
exprsq 
D1. .001918 . 0008987 0.033 .0001565 . 0036795 
hospwork 
D1. . 0059664 .0422144 0.888 -.0767723 .088705 
_cons 0107341 .0396295 0.786 - .0669384 . 0884066 
dsel 
lyhrs 
Li. .0349352 .0586778 0.552 -.1499415 0800711 
expr 
Li. .0563963 0097657 0.000 .0372558 .0755367 
exprsq 
Li. .0010023 . 0002369 0.000 -.0014667 . 0005379 
hospwork 
L1. .0592832 .0709145 0.403 -.1982731 .0797067 
selfempl 
L1. .0607765 .0695311 0.382 -.1970549 0755019 
_cons . 1795994 . 4322929 0.678 -.6676791 1.026878 
/athrho . 1005674 . 175148 0.566 -.2427165 -4438512 
/lnsigma . 8637342 0398811 0.000 -.9418997 -.7855686 
rho . 1002297 . 1733885 -. 2380599 -4168316 
sigma .4215849 .0168133 . 3898865 . 4558604 
lambda .0422553 .0733096 -.1014288 . 1859394 
Wald test of indep. eqns. (rho = 0): chi2(1) = 0.33 Prob > chi2 = 0.5658 


. estimates store Heckpool 


The formal test of independent equations has p = 0.57, indicating that 
there is no selection bias problem. We could just estimate the first-difference 
equation by OLS. 


We next fit the selection model separately for wave 2 and wave 3 and 
generate separate inverse-Mills ratio terms for each wave. 


Wave 2 estimation leads to 


. * Sample-selection model: First-difference OLS for wave 2 given wave 1 
. qui heckman D.lyearn $FDxlist if wave==2, select(s2 = $IPWxlist L.selfempl) 
> mills(mills2) vce(robust) 


. display "N = " e(N) " Independence test chi2(1) = " e(chi2_c) " p=" e(p_c) 
N = 1399 Independence test chi2(1) = 23.096628 p = 1.541e-06 


. estimates store Heck2 


Estimates for the outcome equation are presented in a table below. 


And wave-3 estimation leads to 


. * Sample-selection model: First-difference OLS for wave 3 given wave 1 and 
> wave 2 

. qui heckman D.lyearn $FDxlist if wave==3, select(s3 = $IPWxlist L.selfemp1) 
> mills(mills3) vce(robust) 


. display "N = " e(N) " Independence test chi2(1) = " e(chi2_c) " p =" e(p_c) 
N = 1304 Independence test chi2(1) = 2.5440936 p . 11070743 


. estimates store Heck3 


Estimates for the outcome equation are presented in a table below. 


Pooled OLS estimation with separate inverse-Mills ratio terms for each 
wave yields 


. * Sample-selection model: Pooled first-difference OLS with different lambda for 
> waves 2 and 3 

. generate millsp = mills2 

(4,400 missing values generated) 


. replace millsp = mills3 if wave== 
(1,304 real changes made) 


. regress D.lyearn $FDxlist i.wave#c.millsp, vce(cluster id) 


Linear regression Number of obs = 1,872 
F(6, 1139) = 8.96 
Prob > F = 0.0000 
R-squared = 0.0610 
Root MSE = -4211 
(Std. err. adjusted for 1,140 clusters in id) 
Robust 
D.lyearn | Coefficient std. err. t P>|tl [95% conf. interval] 
lyhrs 
D1. . 2784121 . 0438532 6.35 0.000 . 19237 . 3644541 
expr 
D1. -.0509594 . 0646665 -0.79 0.431 -.1778382 .0759194 
exprsq 
D1. .0015953 .0009852 1.62 0.106 - .0003376 . 0035283 
hospwork 
D1. .0075139 .0422797 0.18 0.859 -.0754408 . 0904687 
wave#c.millsp 
2 . 1706688 . 1442039 1.18 0.237 -.1122662 . 4536039 
3 . 1057655 . 1061797 1.00 0.319 -.1025642 . 3140952 
_cons - .0232218 0560811 -0.41 0.679 -.1332557 0868121 


. test 2b.wave#c.millsp 3.wave#c.millsp 


( 1) 2b.wave#c.millsp = 0 
( 2) 3.wave#tc.millsp = 0 


FC 2, 1139) = 0.70 
Prob > F = 0.4948 


. estimates store Heckboth 


The standard errors correct for clustering but do not correct for the first-stage 
estimation of the inverse-Mills ratio terms. 


The unweighted pooled OLS estimates and the various Heckman selection 
model estimates are given in the following table. 


. * Unweighted, IPW and sample-selection estiamtes: First-difference in wave 2 
. qui regress D.lyearn $FDxlist, vce(robust) noheader 


. estimates store FD_UNW 


. estimates table FD_UNW Heckpool Heck2 Heck3 Heckboth, b(%7.4f) eq(1) 
> se stats(N r2) keep($FDxlist) 


Variable FD_UNW Heckp71 Heck2 Heck3 Heckb~h 
lyhrs 
D1. 0.2766 0.2775 0.3015 0.2580 0.2784 


0.0398 0.0439 0.0499 0.0632 0.0439 


-0.0970 -0.0919 -0.0948 0.0373 -0.0510 
0.0379 0.0393 0.0878 0.1443 0.0647 


exprsq 
D1. 0.0021 0.0019 0.0031 -0.0001 0.0016 
0.0008 0.0009 0.0009 0.0032 0.0010 

hospwork 
D1. 0.0051 0.0060 0.0658 -0.0433 0.0075 
0.0377 0.0422 0.0617 0.0454 0.0423 
N 1872 2703 1399 1304 1872 
r2 0.0600 0.0610 


Legend: b/se 
19.11.5 Sample refreshment 


In practice, panel surveys sometimes anticipate attrition and add an additional 
(refreshment) sample group to compensate for the loss of observations. This 
is another alternative to weighting and model-based methods. Correctly 
implemented, this approach solves the loss of estimation efficiency due to 
sample reduction in case the MAR assumption holds. If, however, attrition is 
selective and not random, then replacement by a random refreshment sample 
does not solve the problem. Whether sample refreshment reduces or 
eliminates the effects of sample attrition depends upon the design of the 
additional survey. 


Refreshment samples can provide additional information about the 
attrition process, allowing for more robust and precise estimation than relying 
solely on conventional methods. However, in practice, the motivation behind 


refreshment sampling may be to obtain additional information on a 
subpopulation of special interest; see Cheng and Trivedi (2015) for an 
empirical example. 


19.12 Additional resources 


For tobit estimation, the relevant entries are [R] tobit, [R] tobit 
postestimation, [R] ivtobit, and [R] intreg. Useful community-contributed 
commands are clad and tobcm. Various MEs can be computed by using 
margins with several different predict options. For tobit panel estimation, 
the relevant command is [XT] xttobit, whose application is covered in 
chapter 22. Normal selection models can be fit using the heckman 
command. For details of fitting nonnormal selection models using copulas 
with the heckmancopula command, see Hasebe (2013). 


The methods in this chapter are highly parametric. Less parametric 
methods lead to partially identified models with bounds being placed using 
moment inequalities. For a recent survey, see Molinari (2020). 


The eintreg and eregress commands, members of the class of ERM 
commands for recursive models, provide ML estimates of a wide range of 
models with endogenous regressors and sample selection; see section 23.7 
for details. 


19.13 Exercises 


— 


. Consider the “linear version” of the tobit model used in this chapter. 


Using tests of homoskedasticity and normality, compare the outcome 
of the tests with those for the log version of the model. 


. Using the linear form of the tobit model in the preceding exercise, 


compare average predicted expenditure levels for those with insurance 
and those without insurance (ins=0). Compare these results with those 
from the tobit model for log(ambexp). 


. Suppose we want to study the sensitivity of the predicted expenditure 


from the log form of the tobit model to neglected homoskedasticity. 
Observe from the table in section 19.8 that the prediction formula 
involves the variance parameter, 72, that will be replaced by its 
estimate. Using the censoring threshold 0, draw a simulated 
heteroskedastic sample from a lognormal regression model with a 
single exogenous variable. Consider two levels of heteroskedasticity, 
low and high. By considering variations in the estimated g2, show how 
the resulting biases in the estimate of g2 from the homoskedastic tobit 
model lead to biases in the mean prediction. 


. Repeat the simulation exercise using regression errors that are drawn 


from a y7(5) distribution. Recenter the simulated draws by subtracting 
the mean so that the recentered errors have a zero mean. Summarize 
the results of the prediction exercise for this case. 


. A conditional predictor for levels E(y|x, y > 0) mentioned in 


section 4.2, given parameters of a model fit in logs, is 

exp(x’ B) N! >, exp(&). This expression is based on the assumption 
that €; are independent and identically distributed but normality is not 
assumed. Apply this conditional predictor to both the parameters of the 
two-part and selection models fit by the two-step procedure, and obtain 
estimates of E'(y|x, y > 0) and E(y|x). Compare the results with 
those given in section 19.8. 


. Repeat the calculations of scores, gresi, and gres2 reported in 


section 19.4.6. Test that the calculations are done correctly; both score 
components should have a zero-mean property. 


. Make an unbalanced panel dataset by using the data of section 8.4 but 


then typing set seed 10101 and drop if runiform() < 0.2. This will 


randomly drop 20% of the individual—year observations. Type 
xtdescribe. Do you obtain the expected patterns of missing data? Use 
xtsum to describe the variation in id, t, wage, ed, and south. How do 
the results compare with those from the full panel? Use xttab and 
xttrans to provide interpretations of how south changes for 
individuals over time. Compare the within estimator with that in 
section 8.5 using the balanced panel. 

. Return to section 19.11.1, which analyzes the effect of rw on the 
estimates. Only a subset of results is reproduced. Verify that the main 
conclusions of the exercise also apply to the estimators for which the 
results are not reported. The code is already provided. 


Chapter 20 
Count-data models 


20.1 Introduction 


In many contexts, the outcome of interest is a nonnegative integer, or a 
count, denoted by Y, y € «No = {0,1,2,...}. Examples can be found in 
demography, economics, ecology, environmental studies, insurance, and 
finance, to mention just a few of the areas of application. 


The objective is to analyze y in a regression setting, given a vector of K 
covariates, x. Because the response variable is discrete, its distribution 
places probability mass at nonnegative integer values only. Fully parametric 
formulations of count models accommodate this property of the 
distribution. Some semiparametric regression models only accommodate 
y > 0 but not discreteness. Count regressions are nonlinear; E'(y|x) is 
usually a nonlinear function, most commonly a single-index function like 
exp(x’@). Several special features of count regression models are 
intimately connected to discreteness and nonlinearity. 


If interest lies solely in modeling the conditional mean of y, and it is felt 
that a good model for this is exp(x’G), then Poisson regression with 
inference based on appropriate robust standard errors is more than adequate. 
Furthermore, Poisson regression for modeling the conditional mean can be 
applied to dependent variables that are continuous and nonnegative, not just 
to counts. In particular, for right-skewed nonnegative data, Poisson 
regression can avoid the complications that can arise with ordinary least- 
Squares (OLS) regression in logs when some observations take value zero. 
This use of Poisson regression is also called exponential regression. 


Standard complications in analyzing count data include the following: 
the presence of unobserved heterogeneity akin to omitted variables; the 
small-mean property of y as manifested in the presence of many zeros, 
sometimes an “excess” of zeros; truncation in the observed distribution of y 
; and endogenous regressors. To deal with these topics, one may have to use 
commands other than the poisson command. For models with fixed effects 
(FE) the Poisson FE model has the special property that, like the linear FE 
model, it does not suffer from the incidental parameters problem; see 
section 22.6.6. 


The chapter begins with the basic Poisson and negative binomial (NB) 
models, using the poisson and nbreg commands, and then details some 
empirically important extensions, including the (double) hurdle or two-part, 
finite-mixture, and zero-inflated (or point-mass) models. The last part of the 
chapter deals with complications arising from endogenous regressors, 
clustered observations, and quantile regression. 


20.2 Modeling strategies for count data 


The natural starting point for analyses of counts is the Poisson distribution, 
with mean / that is necessarily greater than zero because counts are 
nonnegative. For Poisson regression, the standard mean parameterization is 
u = exp(x’3) to ensure that u > 0. 


As explained below, the Poisson maximum-likelihood estimator (MLE) 
for 8 has the robustness property that its consistency requires only that the 
conditional mean be correctly specified; that is, that E (y|x) = exp(x’ 8). 
The data need not be Poisson distributed. This robustness property, enjoyed 
by few MLE, is analogous to consistency of the MLE in the linear model with 
independent homoskedastic normally distributed errors requiring only that 


E(y|x) = x’. 
This leads to two distinct modeling strategies for count data. 


The first approach, adequate for many analyses, is a quasi-ML approach 
that performs Poisson regression using the poisson command. This is 
analogous to simply performing OLS regression for a continuous dependent 
variable. One complication, however, is that robust standard errors should 
always be used, as explained below. 


The second approach is a fully parametric approach. This is necessary 
given complications such as predicting the probability that counts lie in a 
certain range or modeling count data that are truncated or censored. In that 
case, the Poisson distribution is inadequate, and one should use more general 
parametric models such as the NB or finite mixture models. 


We first present the Poisson and NB distributions before providing more 
detail on the modeling strategies. 


20.2.1 Generated Poisson data 


The univariate Poisson distribution, denoted by Poisson(y|/), for the 
number of occurrences of the event y over a fixed exposure period has the 
probability mass function 
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y = 0,1,2,... (20.1) 


where 4 is the intensity or rate parameter. The first two moments are 


This shows the well-known equality of mean and variance property, also 
called the equidispersion property, of the Poisson distribution. And it implies 
that in a Poisson regression Var(y|x) = exp(x’ 6), so the model is 
intrinsically heteroskedastic. 


To illustrate some features of Poisson-distributed data, we use the 
rpoisson() function to make draws from the Poisson(y|u = 1) distribution. 


. * Poisson (mu=1) generated data 
. qui set obs 10000 


. set seed 10101 // Set the seed ! 
. generate xpois = rpoisson(1) // Draw from Poisson(mu=1) 
. qui histogram xpois, discrete xtitle("Poisson") saving(histpois.gph, replace) 
. Summarize xpois 
Variable | Obs Mean Std. dev. Min Max 


xpois 10,000 . 9989 1.004689 0 T 


. tabulate xpois 


xpois Freq. Percent Cum. 
0 3,699 36.99 36.99 
1 3,666 36.66 73.65 
2 1,828 18.28 91.93 
3 609 6.09 98.02 
4 155 1.55 99.57 
5 39 0.39 99.96 
6 3 0.03 99.99 
7 1 0.01 100.00 

Total 10,000 100.00 


The expected frequency of zeros from (20.1) is 

Pr(Y = 0|u = 1) = e+ = 0.368. The simulated sample has nearly 37% 
zeros. Clearly, the larger is 4, the smaller will be the proportion of zeros; for 
example, Pr(Y = O|u = 5) = 0.000067. 


For data with a small mean, as for example in the case of number of the 
children born in a family (or individual data on annual number of accidents 
or hospitalizations), zero observations are an important feature of the data. 
Further, when the mean is small, a high proportion of the sample will cluster 
on a relatively few distinct values. In this example, about 98% of the 
observations cluster on just four distinct values. 


The generated data also reflect the equidispersion property, that is, 
equality of mean and variance of Y , because the standard deviation and 
hence variance are close to 1. 


20.2.2 Overdispersion and NB data 


The equidispersion property of the Poisson is commonly violated in applied 
work because overdispersion is common. Then the (conditional) variance 
exceeds the (conditional) mean. Such additional dispersion can be accounted 
for in many ways, of which the presence of unobserved heterogeneity is one 
of the most common. 


Unobserved heterogeneity, which generates additional variability in y, 
can be generated by introducing multiplicative randomness. We replace u 
with uv, where v is a random variable, hence y ~ Poisson (y| uv). Suppose 
we specify v such that E(v) = 1 and Var(v) = o?. Then it is 
straightforward to show that v preserves the mean but increases dispersion. 
Specifically, E(y) = u and Var(y) = u(1 + uo?) > E(y) = u. The term 
“overdispersion” describes the feature Var(y) > E(y) or, more precisely, 
Var(y|x) > E(y|x) in a regression model. 


In the well-known special case that v ~ Gamma(1, œ), where a is the 
variance parameter of the gamma distribution, the marginal distribution of y 
is a Poisson—gamma mixture with a closed form—the NB distribution 
denoted by NB(u, ~)—whose probability mass function is 
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where T (-) denotes the gamma integral that specializes to a factorial for an 
integer argument. The NB model is more general than the Poisson model 
because it accommodates overdispersion and it reduces to the Poisson model 
as a — 0. The moments of the NB2 are E(y|u, œ) = u and 

Var(y|u, a) = u(1 + ap). Empirically, the quadratic variance function is a 
versatile approximation in a wide variety of cases of overdispersed data. 


The NB regression model lets u = exp(x’@) and leaves a as a constant. 
The default option for the NB regression in Stata is the version with a 
quadratic variance (NB2). Another variant of NB in the literature has a linear 
variance function, Var(y|u, a) = (1 + a), and is called the NB1 model. See 
Cameron and Trivedi (2005, chap. 20.4). 


Using the mixture interpretation of the NB model, we simulate a sample 
from the NB(j = 1, a = 1) distribution. We first use the rgamma (1, 1) 
function to obtain the gamma draw, v, with a mean of 1 x 1 = 1 anda 
variance of œ = 1 x 17 = 1; see section 5.2.4. We then obtain Poisson draws 
with uv = 1 x v =v, using the rpoisson() function with the argument v. 


. * Negative binomial (mu=1 var=2) generated data 
. set seed 10101 // Set the seed ! 


. generate xg = rgamma(1,1) 
. generate xnegbin = rpoisson(xg) // NB generated as a Poisson-gamma mixture 


. qui histogram xnegbin, discrete xtitle("Negative binomial") 
> saving (histnb2.gph, replace) 


. summarize xnegbin 


Variable Obs Mean Std. dev. Min Max 


xnegbin 10,000 1.0044 1.422949 0 15 


. tabulate xnegbin 


xnegbin Freq. Percent Cum. 
0 4,975 49.75 49.75 
1 2,501 25.01 74.76 
2 1,282 12.82 87.58 
3 625 6.25 93.83 
4 319 3.19 97.02 
5 143 1.43 98.45 
6 75 0.75 99.20 
7 33 0.33 99.53 
8 22 0.22 99.75 
9 9 0.09 99.84 
10 9 0.09 99.93 
11 4 0.04 99.97 
13 2 0.02 99.99 
15 1 0.01 100.00 
Total 10,000 100.00 


As expected, the mean is close to 1 and the variance of 1.422 — 2.01 is close 
to (1+ 1) x 1 = 2. Relative to the Poisson(1) pseudorandom draws, this 
sample has more Os, a longer right tail, and a variance-to-mean ratio in 
excess of 1. These features are a consequence of introducing the 
multiplicative heterogeneity term. 


The rnbinomial () function can instead be used to make direct draws 
from the NB distribution, but because it uses an alternative parameterization 
of the NB distribution, it is easier to use the above Poisson—gamma mixture. 


20.2.3 Conditional mean modeling strategies 


Given (20.1), 4. = exp(x’@), and the assumption that the observations 
(y;|x;) are independent, Poisson ML is often the starting point of a modeling 
exercise, especially if the entire distribution and not just the conditional 
mean is the object of interest. 


Count data are often overdispersed. One approach is to maintain the 
conditional mean assumption E(y|x) = exp(x’G). Then one can continue to 
use the Poisson MLE, which retains its consistency. However, one must then 
relax the equivariance assumption and obtain a robust estimate of the 


variance—covariance matrix of the estimator. For independent data, one 
obtains heteroskedastic-robust standard errors using the vce (robust) option. 
For within-cluster correlation, one instead obtains cluster—robust standard 
errors using the vce(cluster clustvar) option. 


Generalized linear models are a class of models for which the 
conditional mean modeling approach can be applied. The Poisson model is a 
member of this class, so Poisson regression can also be performed as a 
special case of the glm command. And the conditional mean approach is the 
basis for generalized method of moments (GMM) estimation of models with 
endogenous regressors. 


The NB2 model, the default for the noreg command, also has the property 
that consistency requires only E'(y|x) = exp(x’G). Thus, either the Poisson 
model or the NB2 model could be used if E(y|x) = exp(x’). While the NB2 
model estimator may be more efficient, in practice, the efficiency gains can 
be slight, and it is common to simply use the Poisson model if interest lies in 
modeling only the conditional mean and not conditional probabilities. 


20.2.4 Poisson regression for continuous nonnegative dependent variable 


Poisson regression 1s not restricted to count data and is applicable to 
continuous nonnegative data where a good model for the conditional mean is 
E(y|x) = exp(x' 8). Thus, Poisson regression is often called exponential 
regression. 


In particular, the lognormal model is often used for continuous right- 
skewed positive data, with E£ (ln y|x) = x’G. This has the limitation of not 
being applicable if some values of y = 0. One approach is to use the two- 
part model of section 19.5. A simpler approach is to perform Poisson 
regression in a model with E(y|x) = exp(x’@), basing inference on robust 
standard errors. This requires the assumption that zero observations come 
from the same process as positive observations. This model also permits 
direct prediction of E'(y|x) without the complication of retransformation 
bias. 


A leading application is to model trade volume between countries using 
the gravity model of trade; see Santos Silva and Tenreyro (2006). 


20.2.5 Fully parametric modeling strategies 


A fully parametric approach begins with the NB2 model. For example, this is 
a starting point if predicted probabilities are desired. 


But complications such as censoring or truncation or different treatment 
of zero counts lead to E(y|x) 4 exp(x’@). The empirical examples in 
sections 20.4 to 20.6 illustrate several extensions of simple Poisson or NB2 
regression. 


20.3 Poisson and negative binomial models 


In this section, we fit Poisson and NB2 count-data models for the annual number 
of doctor visits (docvis). 


20.3.1 Data summary 


The data are a cross-sectional sample from the U.S. Medical Expenditure Panel 
Survey for 2003. We model the annual number of doctor visits (docvis) using a 
sample of the Medicare population aged 65 and higher. 


The covariates in the regressions are age (age), squared age (age2), years of 
education (educyr), presence of activity limitation (act1im), number of chronic 
conditions (totchr), having private insurance that supplements Medicare 
(private), and having public Medicaid insurance for low-income individuals 
that supplements Medicare insurance (medicaid). 


Summary statistics for the dependent variable and regressors are as follows: 


. * Summary statistics for doctor-visits data 
. qui use mus220mepsdocvis, clear 


. global xlist private medicaid age age2 educyr actlim totchr 


. summarize docvis $xlist 


Variable Obs Mean Std. dev. Min Max 
docvis 3,677 6.822682 7.394937 (0) 144 
private 3,677 . 4966005 . 5000564 (0) 1 
medicaid 3,677 . 166712 . 3727692 (0) 1 
age 3,677 74.24476 6.376638 65 90 

age2 3,677 5552.936 958.9996 4225 8100 
educyr 3,677 11.18031 3.827676 (0) 17 
actlim 3,677 .333152 .4714045 (0) 1 
totchr 3,677 1.843351 1.350026 (0) 8 


The sampled individuals are aged 65—90 years, and a considerable portion has an 
activity limitation or chronic condition. The sample mean of docvis is 6.82, and 
the sample variance is 7.392 = 54.61, so there is great overdispersion. 


For count data, one should always obtain a frequency distribution or 
histogram. To reduce output, we create a variable, avrange, with counts of 11—40 


recoded as 40 and counts of 41—143 recoded as 143. We have 


. * Tabulate docvis after recoding values > 10 to ranges 11-40 or 41-143 


. generate dvrange = docvis 


. recode dvrange (11/40 = 40) (41/143 = 143) 


(786 changes made to dvrange) 


. tabulate dvrange 


dvrange Freq. Percent Cum. 
0 401 10.91 10.91 

1 314 8.54 19.45 

2 358 9.74 29.18 

3 334 9.08 38.26 

4 339 9.22 47.48 

5 266 7.23 54.72 

6 231 6.28 61.00 

7 202 5.49 66.49 

8 179 4.87 71.36 

9 154 4.19 75.55 

10 108 2.94 78.49 

40 TTA 21.05 99.54 
143 16 0.44 99.97 
144 1 0.03 100.00 

Total 3,677 100.00 


The distribution has a long right tail, 22% of observations exceed 10, and the 
maximum is 144. More than 99% of the values are under 40. The proportion of 
zeros is 10.9%. This is relatively low for these types of data, partly because the 


data pertain to the elderly population. Samples of the younger and usually 


healthier population often have as many as 90% zero observations for some 


health outcomes. 


20.3.2 Poisson model 


For the Poisson model, the probability mass function is the Poisson distribution 


given in (20.1), and the default is the exponential mean parameterization 


Hi = exp(x;ß), 


i=1,... 


,N 


(20.3) 


where by assumption there are Ķ linearly independent covariates, usually 
including a constant. This specification restricts the conditional mean to be 
positive. 


The Poisson MLE, denoted by B p» is the solution to g nonlinear equations 
corresponding to the ML first-order conditions 


N 
X {yi — exp (x/8)} x; = 0 (20.4) 


t=1 


If x; includes a constant term, then the residuals y; — exp(x/) sum to zero 

based on (20.4). Because the log-likelihood function is globally concave, the 
iterative solution algorithm, usually the Newton—Raphson (see section 16.2), 
converges fast to a unique global maximum. 


An informal approach to consistency is that it requires that the sample 
moment conditions defining an estimator should hold in expectation in the 
population. Thus, consistency requires that the left-hand side of (20.4) have 
expected value 0. A sufficient condition is that E (y;|x;) = exp(x/3). So 
consistency of the Poisson MLE requires only that the functional form for the 
conditional mean be correctly specified. 


By standard ML theory, if the Poisson model is parametrically correctly 
specified, the estimator 8 „ is consistent for 3, with a covariance matrix 


estimated by 


N —1 
? (Bp) = (>: foxx (20.5) 
w=1 


where f; = exp(x/3 p). We show below that it is usually very misleading to use 
(20.5) to estimate the variance—covariance matrix of the estimator (VCE) of B es 
and it is better to use robust standard errors given in (20.6). 


The Poisson MLE is implemented with the poisson command, and the default 
estimate of the vcE is that in (20.5). The syntax for poisson, similar to that for 


regress, IS 
poisson depvar | indepvars | [ of | [ in | | weight | Ee options | 


The vce (robust) option yields a robust estimate of the VCE. 


Two commonly used options are offset () and exposure (). Suppose 
regressor z is an exposure variable, such as time. Then, as z doubles, we expect 
the count, y, to double. Then E(y|z, x2) = z exp(x4 8) = exp(In z + x5). If 
the variable z appears in the regressor list, this constraint is imposed by using the 
offset (z) option. If instead the variable 1nz = ln(z) appears in the regressor 
list, this constraint is imposed by using the exposure (1nz) option. 


Poisson model results 


We first obtain and discuss the results for Poisson ML estimation. 


. * Poisson with default ML standard errors 
. poisson docvis $xlist, nolog 


Poisson regression Number of obs = 3,677 
LR chi2(7) = 4477.98 

Prob > chi2 = 0.0000 

Log likelihood = -15019.64 Pseudo R2 = 0.1297 
docvis | Coefficient Std. err. z P>lzl| [95% conf. interval] 
private . 1422324 .0143311 9.92 0.000 . 114144 . 1703208 
medicaid .0970005 .0189307 5.12 0.000 .0598969 . 134104 

age . 2936722 .0259563 11.31 0.000 . 2427988 . 3445457 

age2 -.0019311 .0001724 -11.20 0.000 -.0022691 -.0015931 

educyr .0295562 .001882 15.70 0.000 . 0258676 . 0332449 
actlim . 1864213 .014566 12.80 0.000 . 1578726 .2149701 
totchr . 2483898 . 0046447 53.48 0.000 . 2392864 . 2574933 

_cons -10.18221 9720115 -10.48 0.000 -12.08732 -8.277101 


The sign of the coefficients gives the sign of marginal effects (MEs). So on 
average, docvis is increasing in education, number of chronic conditions, being 
limited in activity, and having either type of supplementary health insurance. 
These results are consistent with a priori expectations. The effect of age is more 
complicated because it appears quadratically. 


Exactly the same coefficient estimates are obtained using the g1m command; 
see section 13.3.8. 


The top part of the poisson output lists sample size, the likelihood-ratio (LR) 
test for the joint significance of the seven regressors, the p-value associated with 
the test, and the pseudo- R2 statistic, which is intended to serve as a measure of 
the goodness of fit of the model (see section 13.8.1). 


An alternative measure of the fit of the model is the squared coefficient of 
correlation between the fitted and observed values of the dependent variable. 
This is not provided by poisson but is easily computed as follows: 


. * Poisson: Squared correlation between y and yhat 
. predict yphat, n 


. qui correlate docvis yphat 


display "Squared correlation between y and yhat = " r(rho)~2 
Squaned correlation between y and yhat = .1530784 


The squared correlation coefficient is low but reasonable for cross-sectional data. 


The variables in the Poisson model appear to be highly statistically 
significant, but this is partly due to great underestimation of the standard errors, 
as we explain next. 


Heteroskedastic-robust estimate of VCE for Poisson MLE 


As explained after (20.4) the Poisson MLE retains consistency provided that the 
conditional mean function in (20.3) is correctly specified. The dependent 
variable y need not be actually Poisson distributed and need not be a count 
variable. 


When the dependent variable is not Poisson distributed, but the conditional 
mean function is specified by (20.3), we can use the pseudo-ML or quasi-ML 
approach (see section 13.3.1), which maximizes the Poisson likelihood function 
but uses the robust estimate of the VCE, 


N =] 


Vas (Bp) = (> fix {ote Li) ? xx i (sa UiXiX <) (20.6) 


i=1 


where 77; = exp(x/3 pi That is, we use the Poisson MLE to obtain our point 
estimates, but we obtain robust estimates of the vce. With overdispersion, the 


variances will be larger using (20.6) than (20.5) because with equidisperion 
(20.6) reduces to (20.5), but with overdispersion, (y; — fi)? > fi, on average. In 
the rare case of underdispersion, this ordering is reversed. 


This preferred estimate of the VCE is obtained by using the vce (robust) 
option of poisson. We obtain 


. x Poisson with robust standard errors 
. poisson docvis $xlist, vce(robust) nolog // Poisson robust SEs 


Poisson regression Number of obs = 3,677 

Wald chi2(7) = 720.43 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -15019.64 Pseudo R2 = 0.1297 
Robust 

docvis | Coefficient std. err. Zz P>lz| [95% conf. interval] 

private . 1422324 .036356 3.91 0.000 .070976 . 2134889 

medicaid .0970005 . 0568264 1.71 0.088 -.0143773 . 2083783 

age . 2936722 . 0629776 4.66 0.000 . 1702383 .4171061 

age2 -.0019311 . 0004166 -4.64 0.000 -.0027475 -.0011147 

educyr .0295562 . 0048454 6.10 0.000 . 0200594 . 039053 

actlim . 1864213 . 0396569 4.70 0.000 . 1086953 . 2641474 

totchr . 2483898 .0125786 19.75 0.000 . 2237361 . 2730435 

_cons -10.18221 2.369212 -4.30 0.000 -14.82578 -5.538638 


Compared with the Poisson MLE standard errors, the robust standard errors are 
two to three times larger. This is a very common feature of results for Poisson 
regression applied to overdispersed data, and microeconometrics count data are 
often considerably overdispersed. 


Cluster—robust estimate of VCE for Poisson MLE 


For observations correlated within cluster, and independent across cluster, we 
instead obtain cluster—robust standard errors using the vce (cluster clustvar) 
option. This corrects for both overdispersion and clustering. 


Test of overdispersion 


For completeness, we present a formal test of equidispersion, even though 
common practice for Poisson regression is to always obtain robust standard 
errors that are valid even if data are overdispersed or underdispersed. (Similarly, 
it is common to always use robust standard errors after OLS regression, rather 
than first testing whether errors are homoskedastic.) 


A formal test of the null hypothesis of equidispersion, Var(y|x) = E(y|x), 
against the alternative of overdispersion can be based on the equation 


Var(y|x) = E(y|x) + a? E(y|x) 


which is the variance function for the NB2 model. We test Hp: a = 0 against 
Ay: a > 0. 


The test can be implemented by an auxiliary regression of the generated 
dependent variable, {(y — 7)? — y}/f on f, without an intercept term, and 
performing a ¢ test of whether the coefficient of {i is zero; see Cameron and 
Trivedi (2005, 670—671) for details of this and other specifications of 
overdispersion. 


. * Overdispersion test against V(ylx) = E(ylx) + a*{E(y|x)*2} 
. qui poisson docvis $xlist, vce(robust) 


. predict muhat, n 
. qui generate ystar = ((docvis-muhat)“2 - docvis) /muhat 


. regress ystar muhat, noconstant noheader 


ystar | Coefficient Std. err. t P>|t| [95% conf. interval] 


muhat . 7047319 . 1035926 6.80 0.000 .5016273 . 9078365 


The outcome indicates the presence of significant overdispersion. One way to 
model this feature of the data is to use the NB model. More common is to simply 
use poisson with the vce (robust) option. 


Pearson and deviance goodness-of-fit tests 


The Stata postestimation command estat gof provides two goodness-of-fit tests 
after the poisson command, though not after other count regression commands. 


Pearson’s test statistic is for testing variance-mean equality and equals 
ea (yi — fi)? fi» Where f; = exp(x/3p). The deviance statistic is a measure 
used in the generalized linear models literature that is defined in section 13.8.3. 


We apply the command to the Poisson regression model reported above. 


. * Pearson and deviance measures after Poisson 
. qui poisson docvis $xlist, nolog 


. estat gof 
Deviance goodness-of-fit = 18395.14 
Prob > chi2(3669) = 0.0000 
Pearson goodness-of-fit = 23147.38 
Prob > chi2(3669) = 0.0000 


Both statistics strongly reject the Poisson model as a model of the conditional 
distribution. 


Nonetheless, the Poisson model is fine for modeling the conditional mean, 
provided the conditional mean is correctly specified. But inference needs to be 
based on robust standard errors. 


Coefficient interpretation and MEs 


Section 13.7 discusses coefficient interpretation and MEs estimation, both in 
general and for the exponential conditional mean, exp(x’ 8). 


The coefficients can be interpreted as a semielasticity because the conditional 
mean function is of exponential form; see section 13.7.3. Thus, the coefficient of 
educyr Of 0.030 can be interpreted as one more year of education being 
associated with a 3.0% increase in the number of doctor visits. The irr option of 
poisson produces exponentiated coefficients, ., that can be given a 
multiplicative interpretation. Thus, one more year of education is associated with 
doctor visits increasing by the multiple ¢9.030~ 1 930. 


The ME of a unit change in a continuous regressor, £j, equals 
OE(y|x)/Ox; = 8; exp(x’B), which depends on the evaluation point, x. From 
section 13.7.2, there are three standard ME measures. 


It can be shown that the average marginal effect (AME) for the Poisson model 
with an intercept equals Byg- For example, one more year of education is 
associated with 0.02956 x 6.823 = 0.2017 additional doctor visits. The same 
result, along with a confidence interval, can be obtained by using the margins, 
dydx () command. 


To ensure that MEs for binary regressors are calculated using the finite- 
difference method, we use the factor-variable i. operator in defining the model 


to be fit by the poisson command. And to obtain the ME with respect to age, 
which enters quadratically, we use the factor-variable ## operator. We obtain 


. * Poisson AMEs 

. qui poisson docvis i.private i.medicaid c.age##c.age educyr i.actlim 

> totchr, vce(robust) 

. margins, dydx(*) 

Average marginal effects Number of obs = 3,677 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: 1.private 1.medicaid age educyr 1.actlim totchr 


Delta-method 
dy/dx std. err. Z P>lz| [95% conf. interval] 
1.private .9701906 . 2473149 3.92 0.000 . 4854622 1.454919 
1.medicaid . 6830664 .4153252 1.64 0.100 -.130956 1.497089 
age . 0385842 .0172075 2.24 0.025 .0048581 .0723103 
educyr . 2016526 .0337805 5.97 0.000 . 1354441 . 2678612 
1.actlim 1.295942 . 2850588 4.55 0.000 . 7372367 1.854647 
totchr 1.694685 . 0908883 18.65 0.000 1.516547 1.872823 


Note: dy/dx for factor levels is the discrete change from the base level. 


For example, one more year of education is associated with 0.202 additional 
doctor visits. The output also provides confidence intervals for the ME. The ME at 
the mean (MEM) is calculated with the atmeans option of margins, and the ME at a 
representative value (MER) is calculated with the at () option. 


20.3.3 NB2 model 


The NB2 model with a quadratic variance function is consistent with 
overdispersion generated by a Poisson—gamma mixture (see section 20.2.2), but 
it can also be considered simply as a more flexible functional form for 
overdispersed count data. 


The NB2 model MLE, denoted by Bxpo maximizes the log likelihood based on 


the probability mass function (20.2), where again u = exp(x’G), whereas a is 
simply a constant parameter. The estimators Bxpo and @ypz are the solution to 


the K + 1 nonlinear equations corresponding to the ML first-order conditions 


(20.7) 
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The g-element 8 equations, the first line in (20.7), are in general different from 
(20.4) and are sometimes harder to solve using the iterative algorithms. Very 
large or small values of a can generate numerical instability, and convergence of 
the algorithm is not guaranteed. 


Analogous to the discussion for the Poisson MLE, consistency of the NB2 MLE 
requires that the left-hand side of the first equation in (20.7) have expected 
value 0. A sufficient condition is again that E'(y;|x;) = exp(x/3). So 
consistency of the NB2 MLE requires only that the functional form for the 
conditional mean be correctly specified. 


While the NB2 and Poisson models have the same functional form for the 
conditional mean, the fitted probability distribution of the NB2 can be quite 
different from that of the Poisson because of different models for the conditional 
variance. If the data are indeed overdispersed, then the NB2 model is preferred if 
the goal is to model the probability distribution and not just the conditional 
mean. 


The NB2 model is not a panacea. There are other reasons for overdispersion, 
including misspecification due to restriction to an exponential conditional mean. 
Alternative models are presented in sections 20.4—20.6. 


The partial syntax for the MLE for the NB model is similar to that for the 
poisson command: 


nbreg depvar | indepvars | [ of | [ in | [ weight | E options ] 


The default fits an NB2 model, and the dispersion (constant) option fits an NB1 
model. 


As noted in section 13.3.1, we use robust standard errors, unless there is 
reason not to do so. In practice, with independent data, the difference between 
default standard errors and those obtained using the vce (robust) option is not 
great, because overdispersion is reasonably modeled, and is much less than the 


difference following Poisson regression. If data are clustered, then the 
vce (cluster clustvar) option should be used. 


NB2 model results 


Given the presence of considerable overdispersion in our data, the NB2 model 
should be considered. We obtain 


. * NB2: Standard negative binomial with robust standard errors 
. nbreg docvis $xlist, vce(robust) nolog 


Negative binomial regression Number of obs = 3,677 

Wald chi2(7) = 725.69 

Dispersion: mean Prob > chi2 = 0.0000 

Log pseudolikelihood = -10589.339 Pseudo R2 = 0.0352 
Robust 

docvis | Coefficient std. err. z P>lz| [95% conf. interval] 

private . 1640928 .0368851 4.45 0.000 .0917993 . 2363864 

medicaid . 100337 .0567414 1.77 0.077 -.0108742 .2115482 

age . 2941294 . 0646326 4.55 0.000 . 1674518 . 4208069 

age2 -.0019282 . 0004283 -4.50 0.000 -.0027677 -.0010886 

educyr . 0286947 0049211 5.83 0.000 .0190494 . 03834 

actlim . 1895376 .0393922 4.81 0.000 . 1123304 . 2667449 

totchr - 2776441 .013244 20.96 0.000 . 2516864 . 3036019 

_cons -10.29749 2.424128 -4.25 0.000 -15.04869 -5.546284 

/lnalpha -. 4452773 .0377998 -.5193636 -.371191 

alpha . 6406466 .0242163 . 594899 .6899121 


The parameter estimates are all within 15% of those for the Poisson MLE and are 
often much closer than this. The robust standard errors are very similar to those 
from Poisson regression; in practice, there is often little efficiency improvement 
in moving from Poisson to NB estimation of overdispersed count data. The 
parameters and MEs are interpreted in the same way as for a Poisson model 
because both models have the same conditional mean. 


The NB2 estimate of the overdispersion parameter of 0.64 is similar to the 
0.70 from the auxiliary regression used in testing for overdispersion. Because 
a > 0 necessarily, a test of overdispersion is a one-sided test that strongly rejects 
Ho because the Wald statistic equals 0.6406/0.0242 = 26.5 > 1.645 at level 
0.05. If default standard errors are used, then the computer output also includes 
an LR test of Hp: a = 0; an NB example is given in section 11.4.1. 


The pseudo- R2 is 0.035 compared with the 0.130 for the Poisson model. This 
difference, a seemingly worse fit than for the Poisson model, is because the 
pseudo- R2 is not directly comparable across classes of models, here NB2 and 
Poisson. 


More directly comparable is the squared correlation between fitted and actual 
counts. We obtain 


. * NB2: Squared correlation between y and yhat 
. predict ynbhat, n 


. qui correlate docvis ynbhat 


. display "Squared correlation between y and yhat = " r(rho)^2 
Squared correlation between y and yhat = .14979846 


This is similar to the 0.153 for the Poisson model, so the two models provide a 
similar fit for the conditional mean. The real advantage of the NB2 model is in 
fitting probabilities because this relies on specification of the density and not just 
the conditional mean. 


20.3.4 Fitted probabilities for Poisson and NB2 models 


To get more insight into the improvement in the fit, we should compare what the 
parameter estimates from the Poisson and the NB2 models imply for the fitted 
probability distribution of docvis. 


Fitted probabilities for each observation and average fitted probabilities 


Given an estimated parametric model, such as the Poisson or NB2, fitted 
probabilities for each individual and for a particular count value ķ, denoted p;(k) 
, can be generated using the pr() option of the predict postestimation 
command. The same option can give predicted probabilities for a range of count 
values. 


Average predicted probabilities 


: 1 
p(k) = W 2 Alk), k =0,1,2,... (20.8) 


can be obtained using the margins command with the predict (pr ()) option. 


For example, command predict phat2, pr(2) generates variable p;(2) for 
each observation, and command margins, predict (pr (2)) reports p(2) along 
with a standard error and 95% confidence interval. 


We additionally present some convenient community-contributed commands. 


The countfit command to compare fitted and actual cell frequencies 


The community-contributed count fit command (Long and Freese 2014) 
compares the average predicted probabilities p(k) defined in (20.8) with the 
actual cell frequencies, for a range of values of k. The prm option fits the Poisson 
model, and the nbreg option fits the NB2 model. Additional options control the 
amount of output produced by the command. In particular, the maxcount (#) 
option sets the maximum count for which predicted probabilities are evaluated; 
the default is maxcount (9). 


For the Poisson model, we obtain 


. * Poisson: Sample versus avg predicted probabilities of y = 0, 1, ..., 5 
. countfit docvis $xlist, maxcount(5) prm nograph noestimates nofit 
Comparison of Mean Observed and Predicted Count 


Maximum At Mean 
Model Difference Value |Diff | 
PRM 0.102 (0) 0.045 
PRM: Predicted and actual probabilities 
Count Actual Predicted |Diff | Pearson 
(0) 0.109 0.007 0.102 5168.233 
1 0.085 0.030 0.056 387 .868 
2 0.097 0.063 0.034 69.000 
3 0.091 0.095 0.005 0.789 
4 0.092 0.116 0.024 17.861 
5 0.072 0.121 0.049 72.441 
Sum 0.547 0.432 0.269 5716.192 


The Poisson model seriously underestimates the probability mass at low counts. 
In particular, the predicted probabilities at 0 and 1 counts are 0.007 and 0.030 
compared with sample frequencies of 0.109 and 0.085. 


For the NB2 model, which allows for overdispersion, we obtain 


. * NB2: Sample versus average predicted probabilities of y = 0, 1, ..., 5 
. countfit docvis $xlist, maxcount(5) nbreg nograph noestimates nofit 
Comparison of Mean Observed and Predicted Count 


Maximum At Mean 
Model Difference Value |Diff | 
NBRM -0.023 1 0.010 
NBRM: Predicted and actual probabilities 
Count Actual Predicted |Diff | Pearson 
(0) 0.109 0.091 0.018 12.708 
1 0.085 0.108 0.023 17.288 
2 0.097 0.105 0.008 2.270 
3 0.091 0.096 0.005 1.086 
4 0.092 0.085 0.007 2.333 
5 0.072 0.074 0.001 0.072 
Sum 0.547 0.559 0.062 35.757 


The fit is now much better. The greatest discrepancy is for y = 1, with a 
predicted probability of 0.108, which exceeds the sample frequency of 0.085. 


The comparison confirms that the NB2 model provides a much better fit of the 
probabilities than the Poisson model (even though for the conditional mean, the 
MES are similar for the two models). 


The final column, marked Pearson, gives N times (Diff)? /Predicted, 
where Diff is the difference between average fitted and empirical frequencies, 
for each value of docvis up to that given by the maxcount () option. Although 
these values are a good rough indicator of goodness of fit, caution should be 
exercised in using these numbers as the basis of a Pearson chi-squared goodness- 
of-fit test because the fitted probabilities are functions of estimated coefficients; 
see Cameron and Trivedi (2005, 266). 


The chi2gof command to test difference between fitted and actual cell frequencies 


The community-contributed chi2gof command of Manjon and Martinez (2014) 
performs the correct chi-squared goodness-of-fit test that allows for estimation 
error in obtaining the fitted probabilities. 


Applying this to the preceding NB2 example, we obtain 


* NB2: Formal chisquare goodness of fit test 
. qui nbreg docvis $xlist 


chi2gof, cells(0, 1, 2, 3, 4, 5) table 
Chi-square Goodness-of-Fit Test for NegBin Model: 


Chi-square chi2(6) = 41.82 
Prob>chi2 0.00 


Fitted 
Cells Abs. Freq. Rel. Freq. Rel. Freq. Abs. Dif. 


[O to 0] 401 .1091 .0913 .0178 
[1 to 1] 314 .0854 .1079 .0225 
[2 to 2] 358 .0974 . 1054 .0081 
[3 to 3] 334 .0908 .0962 .0053 
[4 to 4] 339 .0922 .0849 .0073 
[5 to 5] 266 .0723 .0735 .0012 
6 or more 1434 . 4528 . 4408 .012 


The same actual and fitted relative frequencies are given, as expected. 
Controlling for estimation error has led to a modest increase in the chi-squared 
test statistic from 35.76 to 41.82. The NB2 model is rejected, though as already 
noted it is a substantial improvement on the Poisson model. 


The prcounts command to compute average fitted probabilities and cumulative probabilities 


The community-contributed prcounts command (Long and Freese 2014) 
computes cumulative average fitted probabilities in addition to the average fitted 
predicted probabilities that are given by the community-contributed count fit 
command. 


We use the max (3) option to reduce the amount of output. 


. * NB2: Cumulative fitted probabilities and ave fitted probs of y = 0 to max() 
. qui nbreg docvis $xlist, vce(robust) 


. prcounts erpr, max(3) 


. summarize erp* 


Variable Obs Mean Std. dev. Min Max 
erprrate 3,677 6.890034 3.486562 2.078925 41.31503 
erprpro 3,677 0912936 . 0446796 .0056767 . 2667141 
erprpr1 3,677 . 1079217 . 0436249 . 0085383 . 2377842 
erprpr2 3,677 . 1054288 .0343211 .0105349 . 1739022 
erprpr3 3,677 .0961651 .024424 .0120494 . 1266245 
erprcu0 3,677 .0912936 .0446796 .0056767 . 2667141 
erprcu1 3,677 . 1992154 .0881782 .0142149 . 5044982 
erprcu2 3,677 . 3046441 . 1220802 . 0247498 . 6784004 
erprcu3 3,677 . 4008093 . 1455686 .0367992 . 7962972 
erprprgt 3,677 .5991907 . 1455686 . 2037028 . 9632007 


The output begins with the erprrate variable, which is the fitted mean and has 
an average value of 6.890, close to the sample mean of 6.823. The erprpro— 
erprpr3 variables are predictions of Pr(y; = j), 7 = 0,1, 2, 3, that have 
averages of 0.091, 0.108, 0.105, and 0.096, the same as given by the count fit 
command. The erprcu0—erprcu3 variables are the corresponding cumulative 
predicted probabilities of 0.092, 0.199, 0.305, and 0.401. 


The prvalue command to predict probabilities at a given value of the regressors 


The community-contributed prvalue command (Long and Freese 2014) predicts 
probabilities for given values of the regressors, computed using p(k|x = x*). 


As an example, we obtain predicted probabilities for a person with private 
insurance and access to Medicaid, with other regressors set to their sample 
means. The prvalue command, with options used to minimize the length of 
output, following the noreg command, yields 


. * NB2: Predicted NB2 probabilities at x = x* of y= 0, 1, ..., 5 
. qui nbreg docvis $xlist, vce(robust) 


. prvalue, x(private=1 medicaid=1) max(5) brief 
nbreg: Predictions for docvis 
95% Conf. Interval 


Rate: 7.34 [ 6.5004, 8.1796] 
Pr (y=0|x) : 0.0660 [ 0.0563, 0.0758] 
Pr (y=1|x): 0.0850 [ 0.0742, 0.0958] 
Pr (y=2|x): 0.0898 [ 0.0802, 0.0994] 
Pr (y=3|x) : 0.0879 [ 0.0802, 0.0955] 
Pr (y=4|x) : 0.0826 [ 0.0771, 0.0882] 
Pr (y=5 |x): 0.0758 [ 0.0722, 0.0793] 


These predicted probabilities at a specific value of the regressors are within 30% 
of the average predicted probabilities for the NB2 model previously computed by 
using the count fit command. 


Predicted probabilities and their MEs using the margins command 


Predicted probabilities, with standard errors and confidence intervals, can be 
obtained using the margins command. For example, 


* NB2: Predicted probabilities and their MEs using margins 
. margins, predict(pr(0)) predict(pr(1)) at(private=1 medicaid=1) atmeans 


gives the preceding predicted probabilities at y = 0 and y = 1. Adding the 
dydx() option will give the marginal effects. 


Discussion 


The assumption of gamma heterogeneity underlying the mixture interpretation of 
the NB2 model is very convenient, but there are other alternatives. For example, 
one can assume that heterogeneity is lognormally distributed. This specification 
does not lead to an analytical expression for the mixture distribution and requires 
an estimation method involving one-dimensional numerical integration. 
Estimation using the gsem command is given in section 23.6.4. 


20.3.5 Generalized NB model 


The generalized NB model is an extension of the NB2 model that permits 
additional parameterization of the overdispersion parameter, a, in (20.2), where 
it is simply a positive constant in the NB2 model. The overdispersion parameter 


can then vary across individuals, and the same variable can affect both the 
location and the scale parameters of the distribution, complicating the 
computation of MEs. Alternatively, the model may be specified such that different 
variables may separately affect the location and scale of the distribution. 


Even though, in principle, flexibility is desirable, such models are currently 
not widely used. The parameters of the model can be fit using the gnbreg 
command, which has a syntax similar to that of noreg, with the addition of the 
Inalpha() option to specify the variables in the model for In(q). 


We parameterize In(q) for the dummy variables female and bh 
(black/Hispanic). 


. * Generalized negative binomial with alpha parameterized 
. gnbreg docvis $xlist, lnalpha(female bh) vce(robust) nolog 


Number of obs = 
Wald chi2(7) = 703.04 


Generalized negative binomial regression 


Prob > chi2 = 0.0000 
Log pseudolikelihood = -10576.261 Pseudo R2 = 0.0347 
Robust 
docvis | Coefficient std. err. Zz P>lz| [95% conf. interval] 
docvis 

private .1571795 . 0367377 4.28 0.000 .085175 .229184 

medicaid .0860199 .0539415 1.59 0.111 -.0197035 . 1917433 

age . 30188 . 0638297 4.73 0.000 .1767761 . 4269838 

age2 -.0019838 . 0004233 -4.69 0.000 -.0028135 -.0011542 

educyr . 0284782 . 0049346 5.77 0.000 .0188066 .0381498 

actlim . 1875403 0387357 4.84 0.000 .1116198 . 2634607 

totchr . 2761519 .0132766 20.80 0.000 . 2501303 . 3021735 

_cons -10.54756 2.39369 -4.41 0.000 -15.23911 -5.856013 
lnalpha 

female -.1871933 0732959 -2.55 0.011 -.3308506 - .0435359 

bh .3103148 . 0949842 3.27 0.001 . 1241493 . 4964803 

_cons -.4119142 . 0582034 -7.08 0.000 -.5259907 - .2978377 


There is some improvement in the log likelihood relative to the NB2 model. The 
dispersion is greater for blacks and Hispanics and smaller for females. However, 
these two variables could also have been introduced into the conditional mean 
function. The decision to let a variable affect a rather than can be difficult to 
justify. 


20.3.6 Nonlinear least-squares estimation 


Suppose one wants to avoid any parametric specification of the conditional 
variance function. Instead, one may fit the exponential mean model by nonlinear 
least squares (NLS) and use a robust estimate of the vce. For count data, this 
estimator is likely to be less efficient than the Poisson MLE because the Poisson 
MLE explicitly models the intrinsic heteroskedasticity of count data, whereas the 
NLS estimator is based on homoskedastic errors. 


The NLS objective function is 
N 
Q(B) = > {yi — exp(xi8)}? 
i=1 


Section 13.3.6 provides an NLS application, using the nı command, for doctor 
visits in a related dataset. 


A practical complication not mentioned in section 13.3.6 is that if most 
observations are 0, then the NLS estimator can encounter numerical problems. 
The NLS estimator can be shown to solve 


S {yi — exp(x!B)} exp(x/8)x; = 0 


w=1 


Compared with (20.4) for the Poisson MLE, there is an extra multiple, exp(x/,), 
which can lead to numerical problems if most counts are 0. NLS estimation using 
the nl command yields 


. * NLS with exponential conditional mean 
. nl (docvis = exp({xb: $xlist} + {constant})), vce(robust) nolog 


Nonlinear regression Number of obs = 3,677 
R-squared = 0.5436 
Adj R-squared = 0.5426 
Root MSE = 6.804007 
Res. dev. = 24528.25 

Robust 
docvis | Coefficient std. err. t P>|t| [95% conf. interval] 
/xb_private . 1235144 0395179 3.13 0.002 0460351 . 2009937 
/xb_medicaid .0856747 . 0649936 1.32 0.188 -.0417525 . 2131018 
/xb_age . 2951153 .0720509 4.10 0.000 . 1538516 . 4363789 
/xb_age2 -.0019481 . 0004771 -4.08 0.000 -.0028836 -.0010127 
/xb_educyr .0309924 0051192 6.05 0.000 0209557 .0410291 
/xb_actlim . 1916735 0413705 4.63 0.000 . 110562 .2727851 
/xb_totchr . 2191967 0151021 14.51 0.000 . 1895874 . 248806 
/constant -10.12438 2.713159 -3.73 0.000 -15.44383 -4.804931 


The NLS coefficient estimates are within 20% of the Poisson and NB2 ML 
estimates, with similar differences for the implied Mmes. The robust standard 
errors for the NLS estimates are about 20% higher than those for the Poisson MLE, 
confirming the expected efficiency loss. 


Unless there is good reason to do otherwise, for count data, it is better to use 
Poisson or NB2 MLEs than to use the NLS estimator. 


20.3.7 Censored and truncated count regression 


Counts are censored if data on the dependent variable are missing for some 
specific count values, though data on regressors are still available even for 
observations with missing counts. The leading example is cases where counts are 
top coded, say, at 10 or more. A second common example is when counts are 
collected in ranges such as 0, 1, 2, 3—5, 6-10, and more than 10. 


The cpoisson command provides ML estimates of the Poisson model with 
left-censoring or right-censoring. A simple example of right-censored (at 10 or 
more) Poisson regression with regressor x iS cpoisson docvis x, ul(10). 
Consistency requires that the underlying complete conditional distribution of Y is 
the Poisson. At the time of writing, no similar command is available for the NB. 


Counts are truncated if data on the dependent variable are missing for some 
specific count values, and additionally data on regressors are also unavailable for 


those observations. The leading example is left-truncation at zero. For example, 
only individuals who participate in an activity may be surveyed to find out how 
many times they participated. 


The tpoisson command, which supersedes the ztp command, provides ML 
estimates of the Poisson model with left-truncation or right-truncation. And the 
tnbreg command, which supersedes the ztnb command, provides ML estimates 
of the NB2 models with left-truncation. The syntax and options for these 
commands are the same as those for the poisson and nbreg commands. In 
particular, the default for tnbreg is to estimate the parameters of a zero-truncated 
NB2 model; an example with left-truncation is given in section 20.4. 


20.4 Hurdle model 


In this section and the next, we consider two types of mixture models that 
involve new specifications of both the conditional mean and variance of the 
distributions. 


20.4.1 Hurdle model 


The hurdle model, or two-part model, relaxes the assumption that the zeros 
and the positives come from the same data-generating process. The zeros are 
determined by the density fı (-), so that Pr(y = 0) = f,(0) and 

Pr(y > 0) = 1 — f1(0). The positive counts come from the truncated 
density fə (yly > 0) = fo(y)/{1 — f2(0)}, which is multiplied by 

Pr(y > 0) to ensure that probabilities sum to 1. Thus, suppressing regressors 
for notational simplicity, we see 


This specializes to the standard model only if f;(-) = fo(-). Although the 
motivation for this model is to handle excess zeros, it is also capable of 
modeling too few zeros. 


A hurdle model has the interpretation that it reflects a two-stage 
decision-making process, each part being a model of one decision. The two 
parts of the model are functionally independent. Therefore, ML estimation of 
the hurdle model can be achieved by separately maximizing the two terms in 
the likelihood, one corresponding to the zeros and the other to the positives. 
This is straightforward. The first part uses the full sample, but the second 
part uses only the positive count observations. 


For certain types of activities, such a specification is easy to rationalize. 
For example, in a model that explains the amount of cigarettes smoked per 


day, the survey may include both smokers and nonsmokers. One model 
determines whether one smokes, and a second model determines the number 
of cigarettes (or packs of cigarettes) smoked given that at least one is 
smoked. 


As an illustration, we obtain draws from a hurdle model as follows. The 
positives are generated by Poisson(2) truncated at 0. One way to obtain 
these truncated draws is to draw from Poisson(2) and then replace any zero 
draw for any observation by a nonzero draw, until all draws are nonzero. 
This can be shown to be equivalent to the accept—reject method for drawing 
random variates that is defined in, for example, Cameron and Trivedi (2005, 
414). This method is simple but is computationally inefficient if a high 
fraction of draws are truncated at zero. To then obtain draws from the hurdle 
model, we randomly replace some of the truncated Poisson draws with 
zeros. A draw is replaced with a probability of 7 and kept with a probability 
1 — m. We set m = 1 — (1 — e~?)/2 ~ 0.568 because this can be shown to 
yield a mean of 1 for the hurdle model draws. The proportion of positives is 
then 0.432. We have 


. * Hurdle: Pr(y=0)=pi and Pr(y=k)=(1-pi) x Poisson(2) truncated at 0 
. clear 


. qui set obs 10000 

. set seed 10101 // Set the seed ! 

. scalar pi=1-(1-exp(-2))/2 // Probability y=0 
. generate xhurdle = 0 


. scalar minx = 0 


. while minx == 0 { 
2. generate xph = rpoisson(2) 
3. qui replace xhurdle = xph if xhurdle== 
4. drop xph 
5. qui summarize xhurdle 


scalar minx = r(min) 


o>) 


7. } 


replace xhurdle = O if runiform() < pi 
(5, 729 real changes made) 


. qui histogram xhurdle, discrete xtitle("Hurdle Poisson") 
> saving(histphurdle, replace) 


. Summarize xhurdle 


Variable Obs Mean Std. dev. Min Max 


xhurdle 10,000 9891 1.413146 (0) 8 


The setup is such that the random variable has a mean of 1. From the 
summary statistics, this is the case. The model has induced overdispersion 
because the variance 1.41312 = 1.997 > 1. 


The hurdle model changes the conditional mean specification. Under the 
hurdle model, the conditional mean is 


E(y|x) = Pr(yı > 0|x1) x Ey.>0(yaly2 > 0, x2) (20.9) 


and the two terms on the right are determined by the two respective parts of 
the model. Because of the form of the conditional mean specification, the 
calculation of MEs, OE (y|x) /Ox,, is more complicated. 


20.4.2 Variants of the hurdle model 


Any binary outcome model can be used for modeling the zero-versus- 
positive outcome. Logit is a popular choice. The second part can use any 
truncated parametric count density, for example, Poisson or NB. In 
application, the covariates in the hurdle part that models the zero/one 
outcome need not be the same as those that appear in the truncated part, 
although in practice they are often the same. 


The hurdle model is widely used, and the hurdle NB model is quite 
flexible. The main drawback is that the model is not very parsimonious. A 
competitor to the hurdle model is the zero-inflated class of models, presented 
in section 20.6.2. 


Two variants of the hurdle count model are provided by the community- 
contributed hplogit and hnblogit commands (Hilbe and Hardin 2005a,b). 
They use the logit model for the first part and either the zero-truncated 
Poisson (ZTP) or the zero-truncated negative binomial (ZTNB) model for the 
second part. The partial syntax is 


hplogit depvar | varlist | [ af | lin | E options | 


where options include robust and nolog, as well as many of those for the 
regression command. Note that these commands restrict the regressors to be 
the same in each part and do not support factor notation. The churdle 
command fits linear and exponential hurdle models. 


20.4.3 Application of the hurdle model 


We implement ML estimation of the hurdle model with two-step estimation 
using official Stata commands, rather than the community-contributed 
commands, to demonstrate how to proceed if regressors differ in the two 
parts. 


The first step involves estimating the parameters of a binary outcome 
model, popular choices being binary logit or probit estimated by using logit 
or probit. The second step estimates the parameters of a ZTP or ZTNB model, 
using the tpoisson command or the tnbreg command. 


We first use logit. We do not need to transform docvis to a binary 
variable before running the logit because Stata does this automatically. This 
is easy to verify by doing the transformation and then running the logit. 


. * Hurdle logit-nb2 model manually: (1) Logit for zeros 
. qui use mus220mepsdocvis, clear 


. logit docvis $xlist, nolog vce(robust) 


Logistic regression Number of obs = 3,677 

Wald chi2(7) = 281.39 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -1040.3258 Pseudo R2 = 0.1788 
Robust 

docvis | Coefficient std. err. Zz P>|z| [95% conf. interval] 

private .6586978 . 1292037 5.10 0.000 . 4054631 .9119324 

medicaid .0554225 . 1738421 0.32 0.750 -.2853017 . 3961468 

age .5428779 . 2290737 2.37 0.018 .0939016 .9918542 

age2 - .0034989 .0015365 -2.28 0.023 -.0065104 - .0004873 

educyr .047035 .0153971 3.05 0.002 .0168571 .0772128 

actlim . 1623927 . 1540586 1.05 0.292 -.1395567 . 4643421 

totchr 1.050562 .0703947 14.92 0.000 .9125907 1.188533 

_cons -20.94163 8.49238 -2.47 0.014 -37 . 58638 -4.296868 


The second-step regression is based only on the sample with positive 
observations for docvis. 


* Hurdle logit-nb2 model manually: (2a) restrict to positives only 
summarize docvis if docvis > 0 


Variable Obs Mean Std. dev. Min Max 


docvis 3,276 7.657814 7.415095 1 144 


Dropping zeros from the sample has raised the mean and lowered the 
standard deviation of docvis. 


The parameters of a ZTNB model are then estimated next by using 
tnbreg. 


. * Hurdle logit-nb2 model manually: (2b) ZTNB for positives 
. tnbreg docvis $xlist if docvis>0, nolog vce(robust) 


Truncated negative binomial regression Number of obs = 3,276 

Truncation point = 0 Wald chi2(7) = 474.34 

Dispersion: mean Prob > chi2 = 0.0000 

Log pseudolikelihood = -9452.899 Pseudo R2 = 0.0262 
Robust 

docvis | Coefficient std. err. Zz P>|zl [95% conf. interval] 

private . 1095567 . 0382086 2.87 0.004 .0346692 . 1844442 

medicaid .0972309 .0589629 1.65 0.099 - .0183342 . 212796 

age . 2719032 .0671328 4.05 0.000 . 1403254 . 403481 

age2 -.0017959 . 000445 -4.04 0.000 -.002668 -.0009238 

educyr .0265974 . 0050938 5.22 0.000 .0166137 .0365812 

actlim . 1955384 .040658 4.81 0.000 .1158503 . 2752266 

totchr . 2226969 .0135761 16.40 0.000 . 1960882 . 2493056 

-cons -9.19017 2.517163 -3.65 0.000 -14.12372 -4.256621 

/lnalpha -. 5259629 . 0544868 -.632755 -.4191708 

alpha . 590986 . 0322009 .5311265 .6575919 


A positively signed coefficient in the logit model means that the 
corresponding regressor increases the probability of a positive observation. 
In the second part, a positive coefficient means that, conditional on a 
positive count, the corresponding variable increases the value of the count. 
The results show that all the variables except medicaid and actlim have 
statistically significant coefficients and that they affect both the outcomes in 
the same direction. 


For this example with a common set of regressors in both parts of the 
model, the community-contributed hnblogit command can instead be used. 
Then, 


* Same hurdle model fit using the community-contributed hnblogit command 
. hnblogit docvis $xlist, vce(robust) 


(output omitted ) 


yields the same parameter estimates as the separate estimation of the two 
components of the model. 


It is straightforward to calculate MEs for the two components separately. 
To ensure that MEs for binary regressors are calculated using the finite- 


difference method, we need to use the factor-variable i. operator in defining 
the model to be fit. And when we replace age and age2 in the regressor list 
with c.age and c.age#c.age, the margins command will give the correct ME 
with respect to age; see section 13.7.11. 


The sample AMEs for whether a doctor visit occurs is obtained by using 
margins, dydx(*) after logit. 


. * Hurdle logit-nb2: AMEs for first part 
. global xlistfv i.private i.medicaid c.age##c.age educyr i.actlim totchr 


. qui logit docvis $xlistfv, vce(robust) 
. margins, dydx(*) 


Average marginal effects Number of obs = 3,677 
Model VCE: Robust 


Expression: Pr(docvis), predict() 
dy/dx wrt: 1.private 1.medicaid age educyr 1.actlim totchr 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 

1.private .0552042 .0106987 5.16 0.000 .0342351 .0761733 
1.medicaid . 0046029 .0142787 0.32 0.747 - .0233829 . 0325886 
age .0024871 .0009149 2.72 0.007 . 0006939 . 0042803 
educyr .0039512 .0012887 3.07 0.002 .0014255 . 0064769 
1.actlim .0133166 .0122861 1.08 0.278 -.0107637 .037397 
totchr .088253 .0055746 15.83 0.000 .0773269 .0991791 


Note: dy/dx for factor levels is the discrete change from the base level. 


Having supplemental private insurance and an additional chronic condition 
are big determinants of whether a doctor visit occurs. 


The sample ames for the second truncated part are obtained by using 


* Hurdle logit-nb2: AMEs for second part 
. qui tnbreg docvis $xlistfv if docvis>0, vce(robust) 


. margins, dydx(*) 


Average marginal effects Number of obs = 3,276 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: 1.private 1.medicaid age educyr 1.actlim totchr 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 

1.private . 7878003 . 2745538 2.87 0.004 . 2496848 1.325916 
1.medicaid . 7220008 . 453203 1.59 0.111 -. 1662607 1.610262 
age . 0280808 .0186977 1.50 0.133 -.008566 . 0647277 

educyr . 191417 .0371488 5.15 0.000 . 1186067 . 2642273 
1.actlim 1.432509 .3066782 4.67 0.000 .831431 2.033587 
totchr 1.60271 . 1039497 15.42 0.000 1.398972 1.806448 


Note: dy/dx for factor levels is the discrete change from the base level. 


Medicaid and having an activity limitation are now bigger determinants than 
in the first part of the model. 


Ideally, we additionally compute the ME with respect to the conditional 
mean given in (20.9) because change in a regressor may change both the 
logit and the truncated count components of the model. This is possible for 
the Poisson-logit model using the gsem command presented in section 23.6, 
followed by a more complicated application of the margins, dydx (*) 
command that uses the expression() option. 


. * Hurdle logit-poisson: AMEs for combined model 
. generate visit = (docvis > 0) 


. generate novisit = 1 - visit 
. qui gsem (docvis <- $xlistfv, family(poisson, ltruncated(novisit))) 
> (visit <- $xlistfv, logit), vce(robust) 
. margins, dydx(*) expression( (exp(predict(eta))/(1-exp(-exp(predict (eta) )))) 
> * predict (equation(visit)) ) 
Average marginal effects Number of obs = 3,677 
Model VCE: Robust 
Expression: (exp(predict (eta) )/(1-exp(-exp(predict(eta))))) * 
predict (equation(visit) ) 
dy/dx wrt: 1.private 1.medicaid age educyr 1.actlim totchr 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 

1.private . 9540639 . 2448103 3.90 0.000 . 4742445 1.433883 
1.medicaid . 5987787 . 4099716 1.46 0.144 -. 2047509 1.402308 
age .0352164 .0170652 2.06 0,039 .0017693 .0686635 

educyr . 1988132 . 0333316 5.96 0.000 . 1334844 . 264142 
1.actlim 1.341224 . 2841913 4.72 0.000 . 7842192 1.898229 
totchr 1.8226 . 0889377 20.49 0.000 1.648285 1.996915 


Note: dy/dx for factor levels is the discrete change from the base level. 


The first part of the gsem command fits the logit model, and the second 
part fits the truncated Poisson with exponential conditional mean as a special 
case of a left-truncated generalized linear model. The expression () option 
of the margins command defines the quantity of interest, here the 
conditional mean given in (20.9). The term Pr(y, > 0|x1) = A(x4 64) is 
computed by predict (equation (visit) ) because the default prediction for 
the logit model for visit is the fitted probability that visit = 1. For the 
truncated Poisson E, ,>o(y2|y2 > 0,x2) = e*22/{1 — exp(—e*22)} and 
for a generalized linear model, the predicted value of x43, is stored in eta. 


The AMEs for the Poisson-logit hurdle model are within 20% of those for 
the simpler Poisson model, a result that might be expected because only 11% 
of the sample has zero doctor visits. 


The gsem command cannot fit an NB-logit model. In this particular 
example, with relatively few observations with zero doctor visits, the ZTP 
regression estimates are similar to the ZTNB estimates, so the ME for the 


conditional mean (20.9) will be quite similar. If interest lies in predicting 
probabilities, however, then the NB-logit model should be used. 


The discussion of model selection is postponed to later in this chapter. 


20.5 Finite-mixture models 


The Poisson distribution is the natural starting point for modeling count data 
because from stochastic process theory, a pure Poisson point process 
generates an exponential distribution for the interval between occurrences of 
an event and a Poisson distribution for the number of events in a given 
interval. Poisson regression models introduce regressors to allow different 
individuals to come from Poisson processes with different means, but in 
practice there is still unobserved heterogeneity that leads to 
microeconometrics data typically not being equidispersed, even after 
inclusion of regressors. 


Richer parametric models for counts therefore explicitly introduce 
unobserved heterogeneity. The NB model introduces a heterogeneity variable, 
v, that is assumed to have a continuous distribution (gamma). This is an 
example of a continuous mixture model. 


An alternative approach instead uses a discrete representation of 
unobserved heterogeneity. This generates a class of models called finite- 
mixture models (FMMs)—a particular subclass of latent-class models; see 
Deb (2007) and Cameron and Trivedi (2005, 678—679). 


20.5.1 FMM specification 
An FMM specifies that the density of y is a linear combination of m different 


densities, where the jth density is f;(y|G,;), j = 1,2,...,m. Thus, an m- 
component finite mixture is 


f(y|B, T = Lah (y|B,), 0S i s.1, > mei 


A simple example is a two-component (m = 2) Poisson mixture of 
Poisson (u ) and Poisson(ji2). This may reflect the possibility that the 
sampled population contains two “types” of cases, whose y outcomes are 


characterized by the distributions f;(y|G,) and fo(y|G.), which are 
assumed to have different moments. The mixing fraction, 71, is in general an 
unknown parameter. In a more general formulation, it, too, can be 
parameterized for the observed variables z. 


20.5.2 Simulated finite-mixture sample with comparisons 


As an illustration, we generate a mixture of Poisson(0.5) and Poisson(5.5) 
in proportions 0.9 and 0.1, respectively. 


* Finite Mixture: Poisson(.5) with prob .9 and Poisson(5.5) with prob .1 
set seed 10101 // Set the seed ! 


. generate xp1= rpoisson(.5) 
. generate xp2= rpoisson(5.5) 


summarize xp1 xp2 


Variable Obs Mean Std. dev. Min Max 
xpi 10,000 . 5007 . 7041655 (0) 5 
xp2 10,000 5.4776 2.332433 (0) 17 


. rename xpi xpmix 
. qui replace xpmix = xp2 if runiform() > 0.9 


. qui histogram xpmix, discrete xtitle("Finite-mixture Poisson") 
> saving (histp2mix, replace) 


summarize xpmix 
Variable Obs Mean Std. dev. Min Max 


xpmix 10,000 . 9696 1.730831 0 14 


The setup yields a random variable with a mean of 

0.9x0.5+0.1 x 5.5 = 1. But the data are overdispersed, with a variance in 
this sample of 1.7312 = 2.996. This dispersion is greater than those for the 
preceding generated data samples from Poisson, NB2, and hurdle models. 


* Finite Mixture: Relative frequencies 
. tabulate xpmix 


xpmix Freq. Percent Cum. 
0 5,459 54.59 54.59 
1 2,794 27.94 82.53 
2 725 7.25 89.78 
3 253 2.53 92.31 
4 166 1.66 93.97 
5 188 1.88 95.85 
6 157 1.57 97.42 
7 89 0.89 98.31 
8 79 0.79 99.10 
9 42 0.42 99.52 
10 27 0.27 99.79 
11 14 0.14 99.93 
12 4 0.04 99.97 
14 3 0.03 100.00 


Total 10,000 100.00 


As for the NB2, the distribution has a long right tail. Although the component 
means are far apart, the mixture distribution is not bimodal; see the 
histogram in figure 20.1. This is because only 10% of the observations come 
from the high-mean distribution. 


It is helpful to view graphically the four distributions generated in this 
chapter: Poisson, NB2, hurdle, and finite mixture. All have the same mean of 
1, but they have different dispersion properties. The generated data were 
used to produce four histograms that we now combine into a single graph. 


* Histograms of four different distributions, all with mean 1 
. graph combine histpois.gph histnb2.gph histphurdle.gph histp2mix.gph, 
> iscale(1.1) ycommon xcommon 
> title("Four different distributions with mean = 1") 


Four different distributions with mean = 1 
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Figure 20.1. Four count distributions with the same mean 


It is helpful for interpretation to supplement this graph with summary 
statistics for the distributions: 


. * Means and standard deviations of four different distributions, all with mean 1 
. summarize xpois xnegbin xhurdle xpmix 


Variable Obs Mean Std. dev. Min Max 
xpois 10,000 . 9989 1.004689 (0) 7 
xnegbin 10,000 . 9982 1.410814 (0) 12 
xhurdle 10,000 .9891 1.413146 (0) 8 
xpmix 10,000 . 9696 1.730831 (0) 14 


20.5.3 ML estimation of the FMM 


The components of the mixture may be assumed, for generality, to differ in 
all their parameters. This is a more flexible specification because all 
moments of the distribution depend upon (rj, Bj, j =1,...,m). But such 
flexibility comes at the expense of parsimonious parameterization. More 
parsimonious formulations assume that only some parameters differ across 
the components, for example, the intercepts, and the remaining parameters 
are common to the mixture components. 


ML estimation of an FMM is computationally challenging because the log- 
likelihood function may be multimodal and not logconcave and because 
individual components may be hard to identify empirically. If numerical 
problems are encountered for a three-component model, for example, then a 
two-component model might be used to provide initial values for the 
estimation of the three-component model. The presence of outliers in the 
sample may cause further identification problems. 


The fmm prefix 


The fmm prefix enables ML estimation of finite-mixture count models. The 
command can be used to estimate mixtures of several continuous and count 
models. Section 14.2 covered the application of the fmm prefix to the normal 
linear regression model. In this chapter, we cover regression models for 
counts. 


The syntax for the standard version of this command is as follows, 
fmm # [ of | [ in | [ weight | [ ; fmmopts | : model depvar indepvars [ r options | 


where # refers to the number of components in the specification and model 
refers to the specification of the distribution. 


The syntax for the hybrid version of the model, which we explain and 
illustrate in section 20.5.8, is as follows: 


fmm [ af | [ in | [ weight | |, fmmopts | : 
(model depvar indepvars l, lcprob(varlist) options |) (component_2) ... 


The default setup assumes class probabilities do not vary over 
individuals. The command supports factor-variable operators and the vce () 
option with all the usual types of VCE. 


An important option is lcprob (varlist), which allows the 7; to be 
parameterized as a function of the variables in varlist. A multinomial logit 
model is specified for the latent class probabilities, with normalization on the 
first component, so 


z exp (z486) 
1+ X; exp (z1 B) 


nj 


20.5.4 Application: Poisson FMM 


We apply the fmm prefix to the doctor-visit data, using both Poisson and NB2 
variants. 


In a two-component Poisson mixture, denoted by FMM2-P, each 
component is a Poisson distribution with a different mean, that is, 
Poisson {exp(x’G;)}, j = 1, 2, and the proportion 7; of the sample comes 
from each subpopulation. This model will have 2K + 1 unknown 
parameters, where K is the number of exogenous variables in the model. For 
the two-component NB mixture, denoted by FMM2-NB, a similar interpretation 
applies, but now the overdispersion parameters also vary between 
subpopulations. This model has 2(K + 1) + 1 unknown parameters. 


We first fit the FMM2-P model. 


. * FMM two-component Poisson with constant probabilities 

. qui use mus220mepsdocvis, clear 

. global xlist i.private i.medicaid c.age##c.age c.educyr i.actlim c.totchr 
. fmm 2, nolog vce(robust): poisson docvis $xlist 
Finite mixture model Number of obs = 3,677 
Log pseudolikelihood = -12100.185 


Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 


1.Class (base outcome) 


2.Class 
_cons -.5980831 .1171272 -5.11 0.000 -.8276481 -.368518 


Class: 1 


Response: docvis 


Model: poisson 
Robust 
Coefficient std. err. z P>|z| [95% conf. interval] 
docvis 
1.private . 2393558 .0695013 3.44 0.001 . 1031357 . 3755759 
1.medicaid .0463821 .0884509 0.52 0.600 -.1269785 . 2197427 
age - .6233526 . 1367728 -4.56 0.000 -.8914223 -.3552829 
c.age#c.age . 0045366 .0009492 4.78 0.000 .0026762 .0063971 
educyr .0284599 .0078417 3.63 0.000 .0130905 . 0438294 
1.actlim . 1723268 .0733314 2.35 0.019 .0285999 . 3160537 
totchr . 3286694 .0215553 15.25 0.000 . 2864218 . 3709169 
_cons 21.35464 4.881451 4.37 0.000 11.78717 30.92211 
Class: 2 
Response: docvis 
Model: poisson 
Robust 
Coefficient std. err. Zz P>lzl [95% conf. interval] 
docvis 
1.private . 1566873 . 0656687 2.39 0.017 .0279789 . 2853957 
1.medicaid . 1924436 . 1383488 1.39 0.164 -.078715 . 4636022 
age 1.232368 .112787 10.93 0.000 1.011309 1.453426 
c.age#c.age -.0085471 .0007425 -11.51 0.000 -.0100024 -.0070917 
educyr .0219929 .0084774 2.59 0.009 .0053775 . 0386082 
1.actlim . 1486859 .0825088 1.80 0.072 -.0130284 . 3104003 
totchr . 1898829 .03185 5.96 0.000 .127458 . 2523078 
_cons -42.46506 4.251345 -9.99 0.000 -50.79755 -34.13258 


The computer output begins with estimates of the mixture probabilities, fit as 
a logit model with normalization on the first class. In this example, the 
probabilities do not depend on regressors, so for all observations we obtain 
that the probability of being in the second latent class or component is 

(1 — T) = exp(—0.5981)/{1 + exp(—0.5981)} = 0.3548. 


The computer output separates the parameter estimates for the two 
components. If the two latent classes differ a lot in their responses to the 
changes in the regressors, we would expect the parameters to also differ. In 


this example, the differentiation does not appear to be very sharp at the level 
of individual slope coefficients. But as we see below, this is misleading 
because the two components have substantially different mean numbers of 
doctor visits, due perhaps to different intercept estimates or the different 
quadratics in age. This leads to quite different MEs even though the slope 
parameters do not seem to be all that different. 


Distribution of fitted means for each component 


These classes are latent, so it is helpful to give them some interpretation. 


One natural interpretation is that classes differ in terms of the mean of 
their respective distributions; that is, exp(x’, Bı) # exp(x; B>). To make this 
comparison, we generate fitted values by using the predict command. For 
the Poisson model, the predictions are 7? = exp(x’, B, \,j =1,2. 


The predict postestimation command generates predictions for each 
observation of the two components that are stored as yhatp1 and yhatp2. 


. * FMM two-component Poisson: Predicted y’s and histogram for each component 
. gui fmm 2, vce(robust): poisson docvis $xlist 


. predict yhatp* 
(option mu assumed) 


. summarize yhatp1 yhatp2 


Variable Obs Mean Std. dev. Min Max 
yhatp1i 3,677 5.050474 4.269029 .9507732 50.52482 
yhatp2 3,677 11.65096 5.670928 .6697477 58.63944 


. qui histogram yhatp1, name(class_1, replace) 
. qui histogram yhatp2, name(class_2, replace) 


. qui graph combine class_1 class_2, iscale(1.2) rows(1) ycommon xcommon 
> ysize(2.5) xsize(6) 


The summary statistics make explicit the implication of the mixture model. 
The first component has a relatively low mean number of doctor visits, 
around 5.05. The second component has a relatively high mean number of 
doctor visits, around 11.65. 


Figure 20.2 provides histograms for each component. Clearly, the second 
component experiences more doctor visits. 
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Figure 20.2. Fitted values distribution, FMM2-P 


Marginal predicted means for each component 
The estat 1cmean postestimation command gives the marginal predicted 
means for each component or latent class. 


. * FMM two-component Poisson: Compute the component marginal predicted means 
. estat lcmean 


Latent class marginal means Number of obs = 3,677 
Delta-method 
Margin std. err. z P>Izl [95% conf. interval] 
1 
docvis 5.050474 = .2713084 18.62 0.000 4.518719 5.582229 
2 
docvis 11.65096 .6255662 18.62 0.000 10.42488 12.87705 


The marginal predicted means are simply the average of the fitted means 
exp(x/3,) in each component. The estat 1cmean output controls for 


estimator imprecision and includes standard errors and confidence intervals. 


Marginal predicted probabilities for each component 


The estat lcprob postestimation command gives the marginal predicted 
probabilities for each component or latent class. In this example, the 
probabilities do not vary over individuals, and from the initial output, we 
computed that the probability for the second component was 0.3548. 


. * FMM two-component Poisson: Compute the component mean predicted probabilities 
. estat lcprob 


Latent class marginal probabilities Number of obs = 3,677 


Delta-method 
Margin std. err. [95% conf. interval] 


1 .6452176 .0268118 .5911008 .6958574 
2 . 3547824 .0268118 . 3041426 . 4088992 


The mixing proportions 71 and 1 — 7 equal 0.65 and 0.35. The probability- 
weighted average number of doctor visits over the two classes is 

0.6452 x 5.0505 + 0.3548 x 11.6510 = 7.39 , which is close to the overall 
sample average of 6.82. 


In summary, the FMM has the interpretation that the data are generated by 
two classes of individuals, the first of which accounts for about 65% of the 
population who are relatively low users of doctor visits and the second of 
which accounts for about 35% of the population who are high users of 
doctor visits. 


ME 


The margins, dydx(*) command gives the AME for each regressor, averaged 
across the two components. 


We have 


* FMM two-component Poisson: MEs 
. margins, dydx(*) 
Average marginal effects Number of obs = 3,677 
Model VCE: Robust 


Expression: Predicted mean (# doctor visits), using class probabilities, 
predict(mu outcome (docvis) ) 
dy/dx wrt: 1.private 1.medicaid age educyr 1.actlim totchr 


Delta-method 
dy/dx std. err. Zz P>|zl [95% conf. interval] 
1.private 1.428504 . 3638522 3.93 0.000 . 7153672 2.141642 
1.medicaid 1.001208 . 8365001 1.20 0.231 - . 6383022 2.640718 
age . 1974592 .0472302 4.18 0.000 . 1048897 . 2900288 
educyr . 1836498 . 0502039 3.66 0.000 .085252 . 2820477 
1.actlim 1.192519 .4101055 2.91 0.004 . 3887271 1.996311 
totchr 1.855912 .1470048 12.62 0.000 1.567788 2.144036 


Note: dy/dx for factor levels is the discrete change from the base level. 


MEs for each component can be obtained using the class () option. For 
example, the command margins, dydx(*) predict (mu class (1)) 
predict (mu class (2)) yields AMEs that differ considerably across the two 
classes. 


20.5.5 Application: NB2 FMM 


The fmm prefix with the noreg model can be used to estimate a mixture 
distribution with NB2 components. 


This model involves additional overdispersion parameters that can 
potentially create problems for convergence of the numerical algorithm. This 
may happen if an overdispersion parameter is too close to zero. Further, the 
number of parameters increases linearly with the number of components, 
and the likelihood function quickly becomes high dimensional when the 
specification includes many regressors. 


The two-component NB2 FMM estimation yields 


* FMM two-component negative binomial with constant probabilities 


. fmm 2, nolog vce(robust): nbreg docvis $xlist 


Finite mixture model 
Log pseudolikelihood = -10534.237 


Number of obs = 3,677 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
_cons . 2286522 .6670507 0.34 0.732 -1.078743 1.536048 
Class: 1 
Response: docvis 
Model: nbreg, dispersion(mean) 
Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
docvis 
1.private . 3292254 . 1225044 2.69 0.007 .0891212 . 5693296 
1.medicaid . 1319335 . 14372 0.92 0.359 -.1497525 -4136196 
age . 3818359 . 1483224 2.57 0.010 .0911293 .6725425 
c.aget#c.age -.0024401 .0009896 -2.47 0.014 -.0043796 -.0005005 
educyr . 0383352 .0119704 3.20 0.001 .0148738 .0617967 
1.actlim .066328 . 1344501 0.49 0.622 -.1971894 . 3298454 
totchr . 5013939 .0755416 6.64 0.000 . 353335 .6494528 
_cons -15.15966 5.686649 -2.67 0.008 -26 . 30529 -4.014036 
/docvis 
lnalpha -.9786601 .4212083 -1.804213 -.153107 


Class: 2 


Response: docvis 


Model: nbreg, dispersion(mean) 
Robust 
Coefficient std. err. 
docvis 
1.private 0957578  .0516769 
1.medicaid .0901739 .0891725 
age .2691574  .0890089 
c.age#c.age -.0017901 . 0005887 
educyr .0241906 .0078689 
1.actlim .219569 .0550544 
totchr .1758912 .0560492 
_cons -8.645314 3.41952 
/docvis 
lnalpha -. 7697485 .093542 


P>|z| 
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[95% conf. interval] 


-.0055271 . 1970427 
-.0846011 . 2649488 
.0947032 4436116 
-.002944 -.0006363 
0087679 .0396133 
1116643 . 3274738 
. 0660368 . 2857456 


-15.34745 -1.943178 


- .9530873 -. 5864096 


The maximized value of the log likelihood (—10,534) is much higher than 
that for the two-component Poisson (— 12,100) (section 20.5.4). It is lower, 
however, than that for the hurdle NB2 (— 10,493) (section 20.4.3), even 
though there are three more parameters in the mixture model. As expected, 
there is evidence of overdispersion in both components; see 1nalpha. 


A comparison of the means of the probabilities and predicted means of 
the two components is possible by using the estat lcprob and estat 
lemean commands. 


. * FMM two-component NB2: Component marginal predicted probabilities and means 
. estat lcprob 


Latent class marginal probabilities Number of obs = 3,677 
Delta-method 
Margin std. err. [95% conf. interval] 
Class 
1 . 4430847 . 1646019 .1771106 . 7462561 
2 .5569153 . 1646019 . 2537439 . 8228894 
. estat lcmean 
Latent class marginal means Number of obs = 3,677 
Delta-method 
Margin std. err. z P>Izl [95% conf. interval] 
1 
docvis 4.477236 1.053625 4.25 0.000 2.412169 6.542303 
2 
docvis 8.828413 . 6283404 14.05 0.000 7.596889 10.05994 


The two classes are somewhat different in probability of occurrence. The 
first class accounts for about 44% of the population who are relatively low 
users of doctor visits (average visits 4.5), and the second class accounts for 
about 56% of the population who are high users (average visits 8.8). 


After the fmm prefix is executed, we can run the predict postestimation 
command to generate component class-specific predicted conditional means 
for each observation. Histograms then provide useful graphical summaries of 
the spread, overlap, and separation between component distributions. This 
analysis was presented for the FMM Poisson example, so it is not repeated 
here. 


20.5.6 Latent class posterior probabilities 


The finite mixture model can be interpreted as a latent class model. In this 
interpretation, an individual comes from exactly one of m classes. 


The probability of individual ; coming from class j is then not simply 
that individual mixing probability 7j:. Instead, the mixing probability is 


weighted by the density f;(y,;i|,). The posterior probability for each 
observation, the estimated probability that an observation belongs to a 
particular latent class, is 


g = Mai Fi(yils) 
oo” Se a Teed ic Ui WOR) 


Note that the posterior probabilities vary with individual characteristics, 
even if the mixing probabilities do not. 


This calculation is done using the predict, classposterior 
postestimation command. 


The following example computes the posterior probabilities of being in 
each class for the two-component Poisson and NB2 models. 


. * Posterior probabilities of each latent class for Poisson and NB2 
. qui fmm 2, vce(robust): poisson docvis $xlist 


. predict postpois*, classposteriorpr 
. gui fmm 2, vce(robust): nbreg docvis $xlist 


. predict postnb*, classposteriorpr 


. sum postpois* postnb* 


Variable Obs Mean Std. dev. Min Max 
postpoist 3,677 .6452176 -4104135 (0) 1 
postpois2 3,677 . 3547824 .4104135 8.23e-19 1 

postnb1 3,677 .4430855 . 2403444 1.03e-25 .8714064 
postnb2 3,677 .5569145 . 2403444 . 1285937 1 


For the Poisson FMM, the posterior probabilities of being in the first latent 
class range from 0.0 to 1.0, whereas the constant mixing probability 

Tı = 0.6452 for all observations. The average of the 3,677 posterior 
probabilities does equal 7, = 0.6452. For NB2 FMM, the posterior 
probabilities of being in the first class range from 0 to 0.8714. 


Kernel density graphs of the latent class posterior probabilities provide 
useful visual information about the separation between the two classes. 


. * Kernel densities for posterior probabilities of each class for Poisson and NB2 
. kdensity postpois1, lwidth(medthick) title(" ") name(postpois1, replace) 


. kdensity postpois2, lwidth(medthick) title(" ") name(postpois2, replace) 
. kdensity postnb1, lwidth(medthick) title(" ") name(postnb1, replace) 
. kdensity postnb2, lwidth(medthick) title(" ") name(postnb2, replace) 


. graph combine postpoisi postpois2 postnbi postnb2, rows(2) ycommon 
> xcommon ysize(5) xsize(6) iscale(0.8) 


Figure 20.3 provides histograms for the posterior probabilities of being 
in each latent class for, respectively, the two-component Poisson and NB2 
models. The smaller the overlap between the densities, the better the 
separation of the distributions. 
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Figure 20.3. Densities of posterior probabilities for FMM Poisson 
and NB2 


20.5.7 Testing the equality of coefficients of mixture components 


We can formally test the equality of coefficients in the two mixture 
components. This test is potentially helpful in interpreting the results in 


terms of regressors that contribute significantly to the separation of latent 
classes. 


To run the test, we first replay the results of the fmm command with the 
coeflegend option, which provides the list of Stata’s internal reference 
names of the coefficients. 


* Use coeflegend to find complete names of model coefficients 
. qui fmm 2, vce(robust): poisson docvis $xlist 


. fmm, coeflegend 


Finite mixture model Number of obs = 3,677 
Log pseudolikelihood = -12100.185 


Coefficient Legend 


1.Class (base outcome) 
2.Class 
_cons -.5980831 _b[2.Class:_cons] 
Class: 1 
Response: docvis 
Model: poisson 


Coefficient Legend 


docvis 
1.private .2393558 _b[docvis:1.private#1.Class] 
1.medicaid .0463821 _b[docvis:1.medicaid#1.Class] 
age -.6233526 _b[docvis:1.Class#c.age] 
c.age#c.age .0045366 _b[docvis:1.Class#c.aget#c.age] 
educyr .0284599 _b[docvis:1.Class#c.educyr] 
1.actlim .1723268 _b[docvis:1.actlim#1.Class] 
totchr .3286694 _b[docvis:1.Class#c.totchr] 
_cons 21.35464 _b[docvis:1.Class] 
Class: 2 
Response: docvis 
Model: poisson 


Coefficient Legend 


docvis 
1.private .1566873 _b[docvis:1.private#2.Class] 
1.medicaid .1924436 _b[docvis:1.medicaid#2.Class] 
age 1.232368 _b[docvis:2.Class#c.age] 

c.age#c.age -.0085471 _b[docvis:2.Class#c.aget#c.age] 

educyr .0219929 _b[docvis:2.Class#c.educyr] 

1.actlim .1486859 _b[docvis:1.actlim#2.Class] 

totchr .1898829 _b[docvis:2.Class#c.totchr] 

_cons -42.46506 _b[docvis:2.Class] 


We consider a test of whether the coefficient of tot chr is the same 
across the two components, using the test command and using the 
contrast command. 


. test (_b[docvis:1.Class#c.totchr] = _b[docvis:2.Class#c.totchr] ) 
( 1) [docvis]ibn.Class#c.totchr - [docvis]2.Class#c.totchr = 0 


chi2( 1) 12.10 
Prob > chi2 = 0.0005 


contrast c.totchr#a.Class, equation(docvis) 


Contrasts of marginal linear predictions 


Margins: asbalanced 


df chi2 P>chi2 
docvis 
Class#c.totchr 1 12.10 0.0005 
Contrast Std. err. [95% conf. interval] 
docvis 
Class#c.totchr 
(1 vs 2) . 1387865 .039905 .0605742 . 2169988 


The two different commands lead to identical outcomes. The null 
hypothesis of equality of coefficients is easily rejected. 


20.5.8 Mixtures with varying mixing probability 


We next illustrate how additional flexibility of the mixture specification can 
be obtained by specifying the mixing proportion as a logit model that 
necessarily restricts the component probabilities to the [0, 1] interval. In 
general, however, it is difficult to separate the effect of a regressor on the 
latent class mean from that on the latent class probability, and it is common 
to specify constant probabilities as in the examples to date. 


Here we use the 1cprob () option to allow the component probabilities to 
vary with actlimand totchr. In our specification, these variables are then 
excluded from the conditional mean function; this restriction is not essential. 


* FMM two-component Poisson: lcprob() option to allow varying mixture 


> proportions 


. fmm 2, lcprob(actlim totchr) nolog vce(robust): poisson docvis private 


> medicaid age age2 educyr 


Finite mixture model 
Log pseudolikelihood = -12037.83 


Number of obs = 3,677 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
actlim . 3811733 . 1052316 3.62 0.000 . 1749231 . 5874235 
totchr . 6328327 .0418741 15.11 0.000 .550761 . 7149043 
_cons -2.23104 . 1079409 -20.67 0.000 -2.4426 -2.019479 
Class: 1 
Response: docvis 
Model: poisson 
Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
docvis 
private . 2326691 .057033 4.08 0.000 . 1208865 . 3444517 
medicaid . 323233 .0993596 3.25 0.001 . 1284916 .5179743 
age . 3376277 .0953861 3.54 0.000 . 1506745 . 5245809 
age2 -.002163 . 0006333 -3.42 0.001 -.0034042 -.0009219 
educyr .0412185 .007761 5.31 0.000 .0260073 .0564297 
_cons -12.49424 3.566101 -3.50 0.000 -19.48367 -5.504816 
Class: 2 
Response: docvis 
Model: poisson 
Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
docvis 
private .1214211 .0611611 1.99 0.047 .0015476 . 2412946 
medicaid . 2310901 . 1269448 1.82 0.069 -.017717 . 4798973 
age . 1773114 . 1115603 1.59 0.112 -.0413426 . 3959655 
age2 -.0011763 .0007359 -1.60 0.110 -.0026186 .0002661 
educyr .0291441 .0077511 3.76 0.000 .0139523 . 044336 
_cons -4.383622 4.194363 -1.05 0.296 -12.60442 3.837178 


The first set of output gives estimates for a logit model of the probability of 
the second component. Given the positive coefficients, having an activity 
limitation and more chronic conditions will push an individual into the 
second latent class, which on average has a higher number of doctor visits. 
The maximized likelihood for this specification (— 12,038) is higher than for 
the model with constant latent class probability (— 12,100). 


The postestimation summary statistics are as follows: 


. * FMM two-component Poisson: Postestimation summary statistics 
. estat lcmean 


Latent class marginal means Number of obs = 3,677 


Delta-method 


Margin std. err. z P>lz| [95% conf. interval] 

1 
docvis 3.374006 . 1267247 26.62 0.000 3.12563 3.622381 

2 
docvis 14.6311 . 5826224 25.11 0.000 13.48918 15.77301 


. estat lcprob 


Latent class marginal probabilities Number of obs = 3,677 


Delta-method 


Margin std. err. [95% conf. interval] 

Class 
1 .6918441 .0177926 .6559241 . 725583 
2 . 3081559 .0177926 . 274417 . 3440759 


. estimates stats 


Akaike’s information criterion and Bayesian information criterion 


Model 


N  11(null) 11(model) df AIC BIC 


3,677 . -12037.83 15 24105.66 24198.81 


Note: BIC uses N = number of observations. See [R] BIC note. 


The latent class marginal means for class 1 and class 2 are, respectively, 
3.37 and 14.63 visits. The latent class probabilities are probabilities from the 
logit model averaged over the 3,677 individuals. 


20.5.9 Mixtures with different sets of regressors in each component 


The specification of the finite mixture model may incorporate a priori 
restrictions if they are available. An exclusion restriction in which one or 
more regressors affect one marginal mean but not the other is an example. 


In the illustrative example of an FMM two-component Poisson model that 
follows, educyr (years of education) affects one component but not the 
other. Additionally, the mixing proportion is parameterized as a function of 
actlim and totchr, as before. The imposed restriction can be tested using 
the LR test. 


. * FMM two-component Poisson: Different regressors in the two components 
. fmm, lcprob(actlim totchr) nolog vce(robust): 


> (poisson docvis private age age2 educyr) 
> (poisson docvis private age age2) 
Finite mixture model Number of obs = 3,677 
Log pseudolikelihood = -12105.773 
Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
actlim . 3620486 . 1010836 3.58 0.000 . 1639284 . 5601688 
totchr .6582347 .0384961 17.10 0.000 . 5827838 . 7336857 
_cons -2.276095 . 1145746 -19.87 0.000 -2.500657 -2.051533 
Class: 1 
Response: docvis 
Model: poisson 
Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
docvis 
private . 2109806 . 0608503 3.47 0.001 .0917161 .3302451 
age . 3539595 . 1041269 3.40 0.001 . 1498745 .5580445 
age2 -.0022691 .0006901 -3.29 0.001 -.0036218 -.0009165 
educyr 0179365 .0055702 3.22 0.001 .007019 .0288539 
_cons -12.79023 3.909087 -3.27 0.001 -20.4519 -5.128563 


Class: 2 
Response: docvis 
Model: poisson 
Robust 
Coefficient std. err. z P>lz| [95% conf. interval] 
docvis 
private .1151981 .065195 1.77 0.077 -.0125817 . 242978 
age .1919411 . 1190989 1.61 0.107 -.0414884 . 4253706 
age2 -.0012686 . 0007857 -1.61 0.106 -.0028086 .0002713 
_cons -4.583763 4.498159 -1.02 0.308 -13.39999 4.232467 


We do not discuss the results because the example just illustrates the 
method of imposing restrictions. 


20.5.10 Model selection 


Choosing the best model involves tradeoffs between fit, parsimony, and ease 
of interpretation. Which of the six models that have been estimated best fits 
the data? 


The LR test can be used to test the Poisson against the NB2 model for the 
basic model, the hurdle model, and the two-component FMM models. In each 
case, this leads to strong rejection of the Poisson model. 


The remaining models are nonnested, so LR tests are not appropriate. 
Instead, we use various information criteria. Table 20.1 summarizes three 
commonly used model-comparison statistics—log likelihood and Akaike 
and Bayes information criteria (AIc and BIC)—explained in section 13.8.2. 
The log likelihood for the hurdle model is simply the sum of log likelihoods 
for the two parts of the model, whereas for the other models, it is directly 
given as command output. 


Table 20.1. Goodness-of-fit criteria for six models 


Model Parameters Log likelihood AIC BIC 
Poisson 8 —15,019.64 30,055 30,105 
NB2 9 —10,589.384 21,197 21,253 
Poisson hurdle 16 —14,037.91 28,108 28,207 
NB2 hurdle 17 —10,493.23 21,020 21,126 
FMM two-component Poisson 17 —12,100.19 24,234 24,340 
FMM two-component NB2 19 —10,534.26 21,106 21,224 


All three criteria suggest that the NB2 hurdle model provides the best fitting 
and the most parsimonious specification. Such an unambiguous outcome is 
not always realized. 


20.5.11 A cautionary note 


It is easy to overparameterize mixture models. When the number of 
components is small, say, 2, and the means of the component distribution are 
far apart, clear discrimination between the components will emerge. 
However, if this is not the case, and a larger value of m is specified, 
unambiguous identification of all components may be difficult because of 
the increasing overlap in the distributions. 


In particular, the presence of outliers may give rise to components that 
account for a small proportion of the observations. For example, if m = 3, 
Tı = 0.6, m2 = 0.38, and 73 = (1 — 0.6 — 0.38) = 0.02, then this means 
that the third component accounts for just 2% of the data. If 2% of the 
sample is a small number, one might regard the result as indicating the 
presence of extreme observations. 


There are a number of indications of failure of identification or fragile 
identification of mixture components. We list several examples. First, the log 
likelihood may increase only slightly when additional components are 
added. Second, the log likelihood may “fall” when additional components 
are added, which could be indicative of a multimodal objective function. 
Third, one or more mixture components may be small in the sense of 
accounting for few observations. 


Finally, slow convergence is to be expected when the expectation 
maximization algorithm is used and identification of individual components 
is weak as suggested by a flat log likelihood. Therefore, it is advisable to use 
contextual knowledge and information when specifying, evaluating, and 
interpreting FMM output. 


20.6 Zero-inflated models 


The hurdle model treats zero counts as coming from a model that differs 
from that for positive counts. The motivation is that a hurdle needs to be 
crossed before the event of interest occurs. When the event occurs, it is 
modeled by a zero-truncated model for one or more event occurrences. 


The related zero-inflated model also has a separate model for zeros but 
additionally allows that zeros come from a regular count model. The 
motivation is that a recording mechanism may occasionally fail to record the 
count, with zero substituting for the missing observation, so the regular 
count model is supplemented by a model for these additional zero counts. 


The zero-inflated count model is a special case of a two-component FMM, 
a mixture of a binary outcome model and a count model. This special case is 
called a point-mass distribution, here with point mass at zero. 


20.6.1 Data summary 


The dataset used in this section overlaps heavily with that used in preceding 
sections of this chapter. It covers the same 3,677 individuals from a cross- 
sectional sample of persons aged 65 and higher from the U.S. Medical 
Expenditure Panel Survey for 2003. 


The main change is that the dependent count variable is now the number 
of emergency room visits (er) by the survey respondent because this has 
many more zeros. An emergency room visit is a rare event for the Medicare 
elderly population who have access to care through their public insurance 
program and hence do not need to use emergency room facilities as the only 
available means of getting care. 


The full set of explanatory variables in the model was initially the same 
as that used in the docvis example. However, after some preliminary 
analysis, this list was reduced to just three health-status variables—age, 
presence of activity limitation (act1lim), and number of chronic conditions 
(totchr). The summary statistics follow, along with a tabulation of the 
frequency distribution for er. 


. * Summary statistics for emergency room visits 
. qui use mus220mepsemergroom, clear 


. global xlisti age i.actlim totchr 


summarize er $xlisti 


Variable Obs Mean Std. dev. Min Max 

er 3,677 .2774001 .6929326 (0) 10 

age 3,677 74.24476 6.376638 65 90 
actlim 

(0) 3,677 . 666848 .4714045 0 1 

1 3,677 . 333152 .4714045 (0) 1 

totchr 3,677 1.843351 1.350026 (0) 8 


. tabulate er 


# emergency 


room visits Freq. Percent Cum. 
0 2,967 80.69 80.69 
1 515 14.01 94.70 
2 128 3.48 98.18 
3 40 1.09 99.27 
4 15 0.41 99.67 
5 8 0.22 99.89 
6 2 0.05 99.95 
7 1 0.03 99.97 
10 1 0.03 100.00 
Total 3,677 100.00 


Compared with docvis, the er variable has a much higher proportion 
(80.7%) of zeros. The first four values (0, 1, 2, 3) account for over 99% of 
the probability mass of er. 


In itself, this does not imply that we have the “excess zero” problem. 
Given the mean value of 0.2774, a Poisson distribution predicts that 
Pr(Y = 0) = e274 — 0,758. The observed proportion of 0.807 is higher 
than this, but the difference could potentially be explained by the regressors 
in the model. So there is no need to jump to the conclusion that a zero- 
inflated variant is essential. 


20.6.2 Models for zero-inflated data 


The zero-inflated model was originally proposed to handle data with excess 
zeros relative to the Poisson model. Like the hurdle model, it supplements a 
count density, f2(-), with a binary process with a density of fı(-). Ifthe 
binary process takes on a value of 0, with a probability of f;(0), then y = 0. 
If the binary process takes on a value of 1, with a probability of fı(1), then y 
takes on the count values 0,1, 2,... from the count density f2(-). This lets 
zero counts occur in two ways: as a realization of the binary process and as a 
realization of the count process when the binary random variable takes on a 
value of 1. 


Suppressing regressors for notational simplicity, the zero-inflated model 
has a density of 


= f fi(0)+ (1— fi(0)}fo(0) ify=0 
fy) = {1 — f1(0)} fo(y) ify>1 


As in the case of the hurdle model, the probability f,(0) may be a constant 
or may be parameterized through a binomial model like the logit or probit. 
Once again, the set of variables in the f,(-) density need not be the same as 
those in the f2(-) density. 


To estimate the parameters of the zero-inflated Poisson (ZIP) and zero- 
inflated NB (ZINB) models, we use the estimation commands zip and zinb, 
respectively. The partial syntax for zip is 


zip depvar | indepvars | lif | [ in | | weight |, 
inflate (varlist | À offset (varname) | |_cons) | options | 


where inflate (varlist) specifies the variables, if any, that determine the 
probability that the count is logit (the default) or probit (the probit option). 
Other options are essentially the same as for poisson. 


The partial syntax for zinb is essentially the same as that for zip. Other 
options are the same as for nbreg. The only NB model fit is an NB2 model. 


For the Poisson and NB models, the count process has conditional mean 
exp(x53,), and the corresponding with-zeros model can be shown to have 


conditional mean 


E(y|x) = {1 — fı (0|x1)} x exp(x382) 


(20.10) 


where 1 — f,(0|x,) is the probability that the binary process variable equals 
1. The MEs are complicated by the presence of regressors in both parts of the 
model, as for the hurdle model. But if the binary process does not depend on 
regressors, so f;(0|x1) = f1(0), then the parameters, Bə, can be directly 
interpreted as semielasticities, as for the regular Poisson and NB models. 


After the zip and zinb commands, the predicted mean in (20.10) can be 
obtained by using the predict postestimation command, and the margins 


command can be used to obtain the AME, MEM, or MER. 


20.6.3 Results for the NB2 model 


Before fitting a zero-inflated model, we apply the NB2 model to emergency 


room visits. 


. * NB2 for emergency room visits 
. nbreg er $xlisti, nolog vce(robust) 


Negative binomial regression Number of obs = 3,677 

Wald chi2(3) = 210.17 

Dispersion: mean Prob > chi2 = 0.0000 

Log pseudolikelihood = -2314.4927 Pseudo R2 = 0.0464 
Robust 

er | Coefficient std. err. z P>|z| [95% conf. interval] 

age . 0088528 .006115 1.45 0.148 -.0031324 .0208381 

1.actlim .6859572 . 0865077 7.93 0.000 .5164052 .8555091 

totchr . 2514885 .0297963 3.44 0.000 . 1930889 . 3098882 

_cons -2.799848 - 4636337 -6.04 0.000 -3.708554 -1.891143 

/lnalpha . 44646385 . 1299578 .1917558 .7011812 

alpha 1.562783 . 2030959 1.211375 2.016133 


There is statistically significant overdispersion with q = 1.56. The 
coefficient estimates are similar to those for the Poisson model (not given). 


The regression equation has low but statistically significant explanatory 
power. For an event that is expected to have a high degree of inherent 
randomness, low overall explanatory power is to be expected. Having an 
activity limitation and a high number of chronic conditions is positively 
associated with er visits. 


20.6.4 Results for the ZINB model 


The parameters of the ZNB model are estimated by using the zinb command. 
We use the same set of regressors in the two parts of the model. 


. * Zero-inflated NB2 for emergency room visits 


. Zinb er $xlist1, inflate($xlist1) nolog vce(robust) 


Zero-inflated negative binomial regression 
Inflation model: logit 


Log pseudolikelihood = -2304.868 


er 


er 
age 

1.actlim 
totchr 

_cons 


inflate 
age 
1.actlim 
totchr 
_cons 


/lnalpha 


alpha 


Coefficient 


. 0035485 
. 2743106 
. 1963408 
-1.822978 


- .0236763 
-4.22705 
-.3471091 
1.846526 


. 1602371 


1.173789 


Robust 
std. 


0075037 
. 2278723 
0727287 
.6681715 


0281379 
22.47385 
. 2899582 
2.039352 


231154 


. 271326 


err. 


P>|z| 


oOo °o 


. 636 
.229 
. 007 
.006 


.400 
.851 
.231 
.365 


.488 


Number of obs 
Nonzero obs 
Zero obs 
Wald chi2(3) 
Prob > chi2 


[95% conf. 


-.0111585 
-.1723109 
.0537951 
-3.13257 


-.0788255 
-48.27499 
-.9154167 
-2.150531 


-.2928164 


. 7461591 


= 3,677 

710 
2,967 
= 22.81 
= 0.0000 


interval] 


.0182555 
. 7209321 
. 3388866 
-.5133861 


.0314729 
39.82089 
.2211986 
5.843584 


.6132905 


1.846497 


The estimated coefficients differ from those from the NB2 model. The two 
models have different conditional means—see (20.10)—so the coefficients 


are not directly comparable. 


The first set of output for the NB2 component implies that er increases as 
the regressors increase. The second set of output for the logit model 


component shows that the probability of a zero decreases as the regressors 
increase, which again implies an increase in er as the regressors increase. 
More generally, there need not be such consistency of coefficient signs 
across the two components. 


ME for ZINB model 


The ZINB model estimates are most easily interpreted by obtaining the AMEs 
using the margins command. We obtain 


. * ZINB: AMEs 
. gui zinb er $xlisti, inflate($xlist1) vce(robust) 


. margins, dydx(*) 


Average marginal effects Number of obs = 3,677 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: age 1.actlim totchr 


Delta-method 
dy/dx std. err. Zz P>|zl [95% conf. interval] 
age . 0020337 0016783 1.21 0.226 -.0012557 .005323 
1.actlim .2032745  .0267769 7.59 0.000 . 1507928 . 2557563 
totchr . 0698286 . 0086039 8.12 0.000 .0529651 . 086692 


Note: dy/dx for factor levels is the discrete change from the base level. 


By comparison, the AMEs for the NB2 models are 


. * NB2: AMEs 
. qui nbreg er $xlist1, vce(robust) 


. margins, dydx(*) 


Average marginal effects Number of obs = 3,677 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: age 1.actlim totchr 


Delta-method 


dy/dx std. err. z P>|z| [95% conf. interval] 

age . 0024632 .0017054 1.44 0.149 - .0008793 .0058057 
1.actlim . 1961981 .0259474 7.56 0.000 . 1453421 .2470541 
totchr . 0699732 . 0088665 7.89 0.000 .0525951 0873513 


Note: dy/dx for factor levels is the discrete change from the base level. 


The AMEs for the statistically insignificant regressor age differ across the two 
models, while the AMEs for actlim and totchr are very similar. 


20.6.5 Point-mass version of zip and zinb commands 


Exploiting the insight that a point-mass distribution is also a finite mixture 
model, Stata’s fmm prefix now provides an alternative to zip and zinb 


commands. 


. * Zero-inflated NB2 estimated using fmm command with point-mass zero 
. fmm, nolog vce(robust): (pointmass er, lcprob($xlist1)) (nbreg er $xlist1) 


Finite mixture model 
Log pseudolikelihood = -2304.8677 


Number of obs = 3,677 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
1.Class 
age -.023672 .028149 -0.84 0.400 -.0788431 .0314991 
1.actlim -4.219059 23.91722 -0.18 0.860 -51.09596 42.65784 
totchr -.3470913 . 2964756 -1.17 0.242 -.9281729 . 2339903 
_cons 1.84657 2.041192 0.90 0.366 -2.154094 5.847233 
2.Class (base outcome) 
Class: 2 
Response: er 
Model: nbreg, dispersion (mean) 
Robust 
Coefficient std. err. Zz P>|zl [95% conf. interval] 
er 
age .003547 .0075502 0.47 0.639 -.0112511 .018345 
1.actlim . 2742822 . 2332858 1.18 0.240 -.1829496 . 731514 
totchr . 1963124 .0760157 2.58 0.010 .0473244 . 3453003 
_cons -1.822645 .6810172 -2.68 0.007 -3.157414 -.4878756 
/er 
lnalpha . 1599948 . 2435035 -.3172633 - 6372529 


The results are the same as those from the zinb command, aside from a 
computational difference that disappears by setting the 1tolerance() option 
to a smaller value than program defaults of 1e-7. 


The postestimation commands estat 1lcmean and estat lcprob provide 


summary statistics. 


* Zero-inflated NB2 estimated using fmm command: 


. predict mu* 

(option mu assumed) 
summarize mu* 

Obs 


Variable Mean 


Std. dev. 


Summary statistics 


Min Max 


mul 3,677 . 3516795 . 1333575 


. estat lcmean 


Latent class marginal means 


. 2034994 1.315176 


Number of obs = 3,677 


Expression: Predicted mean (# emergency room visits in class 2.Class), 


predict (outcome(er) class(2)) 


Delta-method 
Margin std. err. z P>lz| 
2 
er .3516795 . 0468653 7.50 0.000 
. estat lcprob 
Latent class marginal probabilities 
Delta-method 
Margin std. err. [95% conf. 
Class 
1 .2722706 .1017516 . 1202945 
2 . 7277294 . 1017516 . 4941556 


[95% conf. interval] 


. 2598253 . 4435337 


Number of obs = 3,677 


interval] 


. 5058444 
.8797055 


The excess zeros arise with average probability 0.272, the regular NB2 count 
process arises with average probability 0.728, and the regular process counts 


have average 0.352. So the average prediction is 


0.272 x 0 + 0.728 x 0.352 = 0.256 , compared with the sample average 


0.277. 


20.6.6 Model comparison 


Testing the ZNB model against the simpler NB model, is a nonstandard test. 
Versions up to Stata 14 provided an option to implement the LR test of 


Vuong (1989) to discriminate between the NB and ZINB models. However, 
this test is not appropriate because the ZINB model reduces to the NB model 
only at the boundary of the parameter space for the logit model [so that 
f1(0) = 0]; see Wilson (2015). Similar issues arise for testing the zip model 
against the simpler Poisson model. 


Instead, we compare the models on the basis of average predicted 
probabilities and information criteria. 


In principle, model comparison is made simple by applying the 
community-contributed count fit command. Ideally, the command 


* Comparison of NB and ZINB using countfit 
countfit er $xlist1, nbreg zinb nograph noestimates 


fits both NB2 and ZINB models and gives average fitted frequencies and 
information criteria for the two models. At the time of writing, however, this 
command does not work for the zinb option, because it uses the Vuong test, 
which Stata no longer supports. 


Instead, we use two separate and distinct commands. For the NB2 model, 
we use count fit for average fitted and actual frequencies, along with estat 
ic. We obtain 


* NB2 model: Average fitted frequencies and information criteria 
countfit er age actlim totchr, nbreg nograph noestimates nofit 
Comparison of Mean Observed and Predicted Count 


Maximum At Mean 
Model Difference Value |Diff | 
NBRM 0.001 1 0.000 
NBRM: Predicted and actual probabilities 
Count Actual Predicted |Diff | Pearson 
(0) 0.807 0.807 0.000 0.001 
1 0.140 0.139 0.001 0.047 
2 0.035 0.036 0.001 0.053 
3 0.011 0.011 0.000 0.040 
4 0.004 0.004 0.000 0.001 
5 0.002 0.002 0.001 0.558 
6 0.001 0.001 0.000 0.181 
T 0.000 0.000 0.000 0.052 
8 0.000 0.000 0.000 0.610 
9 0.000 0.000 0.000 0.308 
Sum 1.000 1.000 0.004 1.850 


. qui nbreg er $xlist1 
. estat ic 


Akaike’s information criterion and Bayesian information criterion 


Model N 11(null) 11(model) df AIC BIC 


3,677 -2427.068 -2314.493 5 4638.985 4670.035 


Note: BIC uses N = number of observations. See [R] BIC note. 


The average predicted probabilities are close to actual frequencies. 


For the ZNB model, we use the zinb command, followed by the 
prcounts command for average fitted probabilities, along with estat ic. 
We obtain 


. * ZINB model: Average fitted frequencies and information criteria 
. qui zinb er $xlist1, inflate($xlist1) vce(robust) 


. prcounts erpr, max(9) 


. summarize erprpr* 


Variable Obs Mean Std. dev. Min Max 
erprpro 3,677 .8083271 .0967287 .451979 . 9293385 
erprpri 3,677 . 1345371 .0549748 .0579817 . 2367817 
erprpr2 3,677 .038742 .0246506 .0103499 .1310115 
erprpr3 3,677 .0120868 .0108068 .0018967 .0755851 
erprpr4 3,677 .0040225 .0047706 .0003521 .0441736 
erprprd 3,677 .001416 .0021523 .0000659 .0260145 
erprpr6 3,677 .0005238 .000999 .0000124 .0153982 
erprpr?/ 3,677 . 0002026 .0004783 2.34e-06 .0091473 
erprprs 3,677 .0000815 .0002362 4.42e-07 .0054486 
erprpr9 3,677 . 000034 .0001201 8.38e-08 .0032523 

erprprgt 3,677 . 0000267 .0001371 -1.19e-07 .0048559 

. estat ic 


Akaike’s information criterion and Bayesian information criterion 


Model N 11(null) 11(model) df AIC BIC 


3,677 -2322.013 -2304.868 9 4627.735 4683.624 


Note: BIC uses N = number of observations. See [R] BIC note. 


The average predicted probabilities from the ZNB model are not as close to 
the actual frequencies as those from the NB2 model. 


The AIC statistic favors the ZNB model, whereas BIC, which has a larger 
penalty for model size, favors the NB2 model (as 4670 < 4683). 


This example indicates that having many zeros in the dataset does not 
automatically mean that a zero-inflated model is necessary. For these data, 
the ZINB model is only a slight improvement on the NB2 model and is actually 
no improvement at all if BIC is used as the model-selection criterion. 
Furthermore, the NB2 model has the advantage of easier interpretation of 
parameter estimates. 


20.7 Endogenous regressors 


So far, the regressors in the count regression are assumed to be exogenous. 
We now consider a more general model in which a regressor 1s endogenous. 
Specifically, the empirical example used in this chapter has assumed that the 
regressor private is exogenous. But individuals can and do choose whether 
they want supplementary private insurance, and hence potentially this 
variable is endogenous, that is, jointly determined with docvis. If 
endogeneity is ignored, the standard single-equation estimator will be 
inconsistent. 


The general issues are similar to those already presented in section 17.9 
for endogeneity in the probit model. Cameron and Trivedi (2005, chap. 20.6) 
and especially Wooldridge (2010, chap. 18.5) provide details. 


There are two distinct general approaches. One approach specifies a 
structural equation for the count outcome (y1) (usually with an exponential 
conditional mean function) and first-stage equations for endogenous 
regressors, say, Y2. The model is assumed to be recursive in the sense that 
there is no direct feedback from y2 to yı; see section 7.2.3. If the joint 
distribution of yı and y2 can be parametrically specified, then estimation is 
by ML. Alternatively, a control function approach enables two-step 
estimation under weaker assumptions. 


An alternative, less parametric approach is nonlinear instrumental 
variables (Iv) or generalized method of moments (GMM) based on 
orthogonality of instrumental variables with the “error” term in the structural 
model. This approach is applicable to any dependent variable for which an 
exponential conditional mean model is appropriate and for endogenous 
regressors that need not be continuous. 


20.7.1 ML estimator of structural model 


The structural-model approach defines explicit models for both the 
dependent variable of interest (y) and the endogenous regressor (y2). 


Model and assumptions 


The structural equation for the count outcome is a Poisson model for yı with 
a mean that depends on an endogenous regressor Y2. Specifically, 


Yii ~ Poisson(p;) 
hi = E(yrilYzi, Xii, Ure) = exp (B1Y2i + X1, 82 + uri) (20.11) 


This model introduces an error term u; that is uncorrelated with a vector 
of exogenous variables x1 but is potentially correlated with y2, in which case 
y2 is endogenous and the usual Poisson quasi-ML estimator is inconsistent. 


A linear first-stage equation is specified for y2, with 
Yoi = X1 Y1 + Xoo + Ei (20.12) 


where X2 is a vector of exogenous variables that affects y2 nontrivially but 
does not directly affect yı and hence is an independent source of variation in 
y2. It is standard to refer to this as an exclusion restriction and to refer to X2 
as excluded exogenous variables or instrumental variables. By convention, a 
condition for robust identification of (20.11), as in the case of the linear 
model, is that there is available at least one valid excluded variable 
(instrument). When only one such variable is present in (20.12), the model is 
said to be just identified, and it is said to be overidentified if there are 
additional excluded variables. 


Endogeneity arises because the errors ui and £ are correlated. 
Specifically, assume that 


Uli = pEi + Ni (20.13) 


where n; ~ [0,07] is independent of e; ~ [0,02]. Then € is a common latent 
factor that affects both yı and y2 and is the only source of dependence 
between them, after controlling for the influence of the observable variables 
Xı and X2. The variable y2 is endogenous in the model for yı, unless p = 0, 
because both y2 and ui appear in the model (20.11) and both depend on e, 
via (20.12) and (20.13). 


The error term u can be interpreted as unobserved heterogeneity that 
has a multiplicative effect on the conditional mean because 
Hi = exp(b1Y2i +X}; 3) x e*. Note that the error term not only induces 
endogeneity in (20.13) but also induces overdispersion, so that the Poisson 
model has been generalized to control for overdispersion as would be the 
case if an NB model was used. 


ML estimation 


ML estimation of this model, under the additional assumption of normally 
distributed errors £; and M, can be obtained using the gsem command 
detailed in section 23.6. 


For the specific application outlined below, complete ML estimates are 
presented in section 23.6.5 and are not reproduced here. We merely note that 
ML estimation leads to the endogenous regressor private having coefficient 
0.633 with standard error 0.219. 


20.7.2 Control function estimator of structural equation 


A control function estimator, or residual-augmentation estimator, obtains 
consistent estimates of the structural equation by adding an appropriate 
residual term as an additional regressor. For the current count application, it 
relies only on appropriate conditional mean assumptions and does not 
require the Poisson and normal distributional assumptions of the preceding 
ML approach. 


Substituting (20.13) for ui in (20.11) yields 
u = exp(Ciy2 + X} Bə + pe)e”. Taking the expectation with respect to 7 
yields 
E(u) = exp(Biys + x4 B2 + pe) x E(e") = exp(Biyo + In E(e") + x4 B3 + pe) 
. The constant term In E'(e”) can be absorbed in the coefficient of the 
intercept, a component of xı. It follows that 


Hi|Xii; Yau, Ei = exp(b1Y2i + X1ib2 + pei) (20.14) 


where €; is a new additional variable. 


If £ were observable, including it as a regressor would control for the 
endogeneity of y2. Given that it is unobservable, the estimation strategy is to 
replace it by a consistent estimate. 


The following two-step estimation procedure is used: First, fit the first- 
stage model (20.12) by OLS, and generate the residuals ¢,. This requires the 
assumption that an instrument exists and that (20.12) is correctly specified so 
that E'(€;|x1;,X2;) = 0. Second, estimate parameters of the Poisson model 
given in (20.14) after replacing £; by €;. As discussed below, if p = 0, then 
we can use the vce (robust) option, but if p Æ 0, then the vCE needs to be 
estimated with the bootstrap method detailed in section 12.4.5 that controls 
for the estimation of €; by €; or by a method similar to that in 
section 13.3.11. 


Application 


We apply this two-step procedure to the Poisson model for the doctor-visits 
data analyzed in section 20.3, with the important change that private is now 
treated as endogenous. 


Two excluded variables used as instruments are income and ssiratio. 
The first is a measure of total household income, and the second is the ratio 
of social security income to total income. Jointly, the two variables reflect 
the affordability of private insurance. A high value of income makes private 
insurance more accessible, whereas a high value of ssiratio indicates an 
income constraint and is expected to be negatively associated with private. 
For these to be valid instruments, we need to assume that for people aged 
65-90 years, doctor visits are not determined by income or ssiratio, after 
controlling for other regressors that include a quadratic in age, education, 
health-status measures, and access to Medicaid. 


The first step generates residuals from a linear probability regression of 
private on regressors and instruments. 


. * Poisson control function estimator: First-stage linear regression 
. qui use mus220mepsdocvis, clear 


. global xlist2 medicaid age age2 educyr actlim totchr 


. regress private $xlist2 income ssiratio, vce(robust) 


Linear regression Number of obs = 3,677 
F(8, 3668) = 249.61 
Prob > F = 0.0000 
R-squared = 0.2108 
Root MSE = . 44472 

Robust 
private | Coefficient std. err. t P>|t| [95% conf. interval] 
medicaid - .3934477 0173623 -22.66 0.000 -.4274884 -.3594071 
age -.0831201 .0293734 -2.83 0.005 -.1407098 -.0255303 
age2 .0005257 .0001959 2.68 0.007 .0001417 . 0009098 
educyr .0212523 . 0020492 10.37 0.000 .0172345 .02527 
actlim -.0300936 .0176874 -1.70 0.089 -.0647718 . 0045845 
totchr .0185063 . 005743 3.22 0.001 .0072465 .0297662 
income .0027416 . 0004736 5.79 0.000 .0018131 . 0036702 
ssiratio -.0647637 .0211178 -3.07 0.002 -.1061675  -.0233599 
_cons 3.531058 1.09581 3.22 0.001 1.3826 5.679516 


. predict lpuhat, residual 


The two instruments, income and ssiratio, are highly statistically 
significant with expected signs. 


The second step fits a Poisson model on regressors that include the first- 
step residual. 


. * Poisson control function estimator: Second-stage poisson regression 
. poisson docvis private $xlist2 lpuhat, vce(robust) nolog 


Poisson regression Number of obs = 3,677 

Wald chi2(8) = 718.87 

Prob > chi2 = 0.0000 

Log pseudolikelihood = -15010.614 Pseudo R2 = 0.1303 
Robust 

docvis | Coefficient std. err. z P>|z| [95% conf. interval] 


private 
medicaid 


age 
age2 
educyr 
actlim 
totchr 
lpuhat 
_cons 


5505541 
. 2628822 
. 3350604 
- .0021923 
018606 

. 2053417 
24147 

- .4166838 
-11.90647 


. 2453175 
.1197162 
. 0696064 
. 0004576 
.0080461 
.0414248 
.0129175 

. 249347 
2.661445 


oOoOo0oo0oo0oo0o0000O 


.025 
.028 
. 000 
. 000 
.021 
. 000 
. 000 
.095 
. 000 


. 0697407 
. 0282428 
. 1986344 
- . 0030893 
. 002836 
. 1241505 
.2161523 
- . 9053949 
-17.1228 


1.031368 

.4975217 

.4714865 

-.0012954 
. 034376 

. 286533 

. 2667878 

.0720272 

-6.69013 


The z statistic for the coefficient of 1punat provides the basis for a 
robust Wald test of the null hypothesis of exogeneity, Ho: p = 0. The z 
statistic has a p-value of 0.095 against H; : p Æ 0, leading to nonrejection of 
Ho at the 0.05 level. But a one-sided test against H; : p < 0 may be 
appropriate because this was proposed on a priori grounds. Then the p-value 
is 0.048, leading to rejection of Hp at the 0.05 level. 


Bootstrapped standard errors 


If po Æ 0, then the vce of the second-step estimator needs to be adjusted for 
the replacement of £; with £. One way to do so is to adapt the method given 
in section 13.3.11 for the linear model. Here we instead bootstrap as in 
section 12.4.5. We have 


. * Poisson control function estimator: Obtain bootstrap standard errors 


> for estimator 


. program endogtwostep, eclass 


1. version 17 
tempname b 


matrix `b” = e(b) 
ereturn post `b’ 
end 


OONOOFWND 


capture drop lpuhat2 
regress private $xlist2 income ssiratio 
predict lpuhat2, residual 
poisson docvis private $xlist2 lpuhat2 


. bootstrap _b, reps(400) seed(10101) nodots nowarn: endogtwostep 
Number of obs = 3,677 


Bootstrap results 


Replications = 400 

Observed Bootstrap Normal-based 
coefficient std. err. Zz P>lz| [95% conf. interval] 
private .5505541 . 2799587 1.97 0.049 .0018452 1.099263 
medicaid . 2628822 . 1284541 2.05 0.041 .0111169 .5146476 
age . 3350604 . 0744832 4.50 0.000 . 189076 .4810449 
age2 -.0021923 . 0004888 -4.48 0.000 -.0031504 -.0012342 
educyr .018606 .0091275 2.04 0.042 .0007164 . 0364957 
actlim . 2053417 .0435134 4.72 0.000 .1200571 . 2906264 
totchr . 24147 .0134558 17.95 0.000 .2150971 . 267843 
lpuhat2 -.4166838 . 2827737 -1.47 0.141 -.9709101 . 1375424 
-cons -11.90647 2.851588 -4.18 0.000 -17.49548 -6.317456 


The standard errors differ little from the previous standard errors obtained by 
using the option vce (robust). From section 20.3.2, the Poisson ML estimate 
of the coefficient on private was (0.142 with a robust standard error of 
0.036. The two-step estimate of the coefficient on private is 0.551 with a 
standard error of 0.245. (The ML estimate given in section 23.6.5 is 0.633 


with standard error 0.219.) 


The precision of estimation is much less than the exogenous case, with a 
standard error that is seven times larger. This large increase is very common 
for cross-sectional data, where instruments are not very highly correlated 
with the regressor being instrumented. At the same time, the coefficient is 
four times larger, and so the regressor retains statistical significance. The 
effect is now very large, with private insurance leading to a 
100(e°-55! — 1) = 73% increase in doctor visits. 


The negative coefficient of 1puhat2 can be interpreted to mean that the 
latent factor, which increases the probability of purchasing private insurance 
lowers the number of doctor visits—an effect consistent with favorable 
selection, according to which the relatively healthy individuals self-select 
into insurance. Controlling for endogeneity has a substantial effect on the ME 
of an exogenous change in private insurance because the coefficient of 
private and the associated MEs are now much higher. 


20.7.3 Nonlinear IV (GMM) estimators 


An alternative method for controlling for endogeneity is nonlinear Iv or GMM 
based on a moment condition that does not require specification of model or 
error distributions. 


The literature on count-data models makes a case for both additive and 
multiplicative forms of heterogeneity; in turn, these imply different moment 
functions and hence potentially different GMM estimates. See, for example, 
Cameron and Trivedi (2005, chap. 20.6) for explanation. 


The nonlinear Iv presented in section 16.8 can be motivated as follows. 
Suppose y1; = exp(G1 yo; + X4; B2) + ur, so the error term u1; has an 
additive effect. Then u1; = yi — exp(G1y2i + x{,G.), and we assume 
existence of instruments Z; satisfying E(u ;|z;) = 0 and hence 
E(z;,u1;) = 0. This leads to moment condition 


E [z; {y1i — exp (G1 y2i + X4482)}] = 0 (20.15) 


where the instruments z; = (x‘/,,x5,)’, with X2; being additional variables 
necessary for identification. 


Mullahy (1997) noted that for count data and, more generally, for an 
exponential conditional mean, it is more natural to work with a 
multiplicative error, as in (20.11). Let y1; = exp(Giya; + x}; 82) wii: 
Assume there are instruments Z; satisfying E(w ;|z;) = 1, where there is no 
loss in generality in setting the expectation to 1 if x1; includes a constant. 
Now E(w; — 1|z;) = 1 by assumption and w1; = y1;/ exp(Giyoi + x} ;B2) 


, 80 E{y;/ exp(Biyo; + X4;ßB2) — 1\z;} = 1. This leads to moment 
condition 


E [zi {y1:/ exp (b1Y2i + 14,82) — 1}] = 0 (20.16) 


where again the instruments z; = (x};,X,;)’, with X2; being additional 
variables necessary for identification. 


Equations (20.11)-(20.13) do not imply either (20.15) or (20.16), so this 
less parametric approach will lead to an estimator that differs from that using 
the structural approach even in the limit as N — oo. This approach seems 
more appropriate especially when y2; is binary and therefore less likely to 
satisfy assumption (20.12). 


ivpoisson gmm command 


These nonlinear Iv estimators can be implemented using the ivpoisson gmm 
command, which has basic syntax similar to the ivregress command, and 
the weight matrix options of the gmm command. The default is an additive 
error, while the multiplicative option is used for a multiplicative error. 
The default is for two-step estimation, while the onestep option is used for 
one-step GMM estimation. Two-step estimation is relevant for an 
overidentified model, and in that case the estat overid postestimation 
command performs an overidentifying restrictions test. The default is for 
ivpoisson to report heteroskedastic—robust standard errors. 


The following Stata commands compute one-step and two-step 
estimators for the additive and multiplicative error models. For all 
estimators, we use the same instruments as before. 


. * ivpoisson gmm: (1)-(2) additive one and two step; (3)-(4) multiplicative 


> one and two step 
. qui ivpoisson gmm 


docvis 


. estimates store GMMaddiS 


. qui ivpoisson gmm 


docvis 


. estimates store GMMadd2S 


. qui ivpoisson gmm 


. estimates store GMMmultiS 


. qui ivpoisson gmm 


. estimates store GMMmult2S 


docvis 


docvis 


$xlist2 (private=income ssiratio), onestep 


$xlist2 (private=income ssiratio) 


$xlist2 (private=income ssiratio), onestep multiplic 


$xlist2 (private=income ssiratio), multiplicative 


The estimates are as follows: 


* Table for GMM additive one step and two step and multiplicative 
> one step and two step 
. estimates table GMMaddiS GMMadd2S GMMmultiS GMMmult2S, 


> b(47.4£) seC(47.3f) stats(N) stfmt(4%9.1f) modelwidth(9) 
Variable GMMaddisS GMMadd2S GMMmultis GMMmult2S 
private 0.5921 0.5864 0.7912 0.7876 
0.340 0.341 0.280 0.279 

medicaid 0.3187 0.2876 0.3290 0.3010 
0.191 0.191 0.110 0.109 

age 0.3323 0.3496 0.3534 0.3698 
0.071 0.071 0.076 0.075 

age2 -0.0022 -0.0023 -0.0023 -0.0024 
0.000 0.000 0.001 0.000 

educyr 0.0191 0.0174 0.0122 0.0113 
0.009 0.009 0.009 0.008 

actlim 0.2085 0.1843 0.2042 0.1908 
0.043 0.042 0.043 0.042 

totchr 0.2418 0.2461 0.2855 0.2893 
0.013 0.013 0.015 0.015 

_cons -11.8635 -12.5396 -12.7527 -13.3824 
2.733 2.738 2.876 2.836 

N 3677 3677 3677 3677 


Legend: b/se 


The results are qualitatively similar to preceding results. The one-step and 
two-step estimates are very similar with similar precision. The coefficient of 
the endogenous regressor private is larger, and is more precisely estimated, 


with a multiplicative error. 


The overidentifying restrictions test presented for the linear Iv model in 
section 7.4.8 extends to nonlinear GMM models. For this example, following 
two-step GMM estimation of the multiplicative error model, we have 


. * Test of overidentifying restriction following two-step GMM 


. qui ivpoisson gmm docvis $XLISTEXOG (private=income ssiratio), multiplicative 


. estat overid 
Test of overidentifying restriction: 
Hansen’s J chi2(1) = .390236 (p = 0.5322) 


The test statistic is y?(1) distributed because there is one overidentifying 
restriction (income and ssiratio are instruments for private, and all other 
regressors are instruments for themselves). Because p > 0.05, we do not 
reject the null hypothesis and conclude that the overidentifying restriction is 
valid. 


The same models could be fit using the gmm command. For one-step 
estimation, for example, the commands are 


. * gmm command for additive and multiplicative one step 

. gmm (docvis - exp({xb:private $xlist2 _cons})), 

> instruments(income ssiratio $xlist2) onestep vce(robust) 
(output omitted ) 

. gmm ( (docvis / exp({xb:private $xlist2 _cons})) - 1), 

> instruments(income ssiratio $xlist2) onestep vce(robust) 
(output omitted ) 


The estimates and standard errors, not reported here, are identical to those 
obtained using the ivpoisson gmm command. 


ivpoisson cfunction command 


A variant of the preceding GMM approach with a multiplicative error 
introduces a relationship between the structural equation and first-stage 
model errors, leading to the addition of the predicted residual from the first- 
stage regression as an additional variable in moment condition (20.16). 


Assume that the errors in (20.11) and (20.12) are related by 
Uri = pEi + Ni, as in (20.13), where 7: is independent. Then the moment 
condition becomes 


E |z; {y1:/ exp (Bryai + X1:b2 + pei) — 1}] = 0 (20.17) 


where £i is estimated by the residual from first-stage regression based on 
(20.12). This is implemented by replacing €: in (20.17) with €; = yo; — zy, 
, where z; = (x};,X5,)’, and adding a second moment condition 

E{zi (yo: — zi7y)} = 0 for the first-stage estimation. 


ivpoisson cfunction provides one-step GMM estimates of this model. 
The term cfunction is used because it is a control function approach, but 
note that the control function approach is being applied to a different model 
than that in section 20.7.2. We obtain 


* ivpoisson cfunction estimation for endogenous Poisson 
ivpoisson cfunction docvis medicaid age age2 educyr actlim totchr 


> (private=income ssiratio), vce(robust) 
Step 1 

Iteration 0: GMM criterion Q(b) = .00243925 
Iteration 1: GMM criterion Q(b) = 1.982e-06 
Iteration 2: GMM criterion Q(b) = 1.137e-12 


Iteration 3: GMM criterion Q(b) = 5.235e-25 
note: model is exactly identified. 


Exponential mean model with endogenous regressors 


Number of parameters = 18 Number of obs = 3,677 
Number of moments = 18 
Initial weight matrix: Unadjusted 
GMM weight matrix: Robust 
Robust 
docvis | Coefficient std. err. Zz P>lz| [95% conf. interval] 
docvis 
medicaid .3158814 . 1232552 2.56 0.010 0743057 .5574571 
age . 3507938 0724955 4.84 0.000 . 2087052 . 4928823 
age2 - .0022849 . 0004785 -4.78 0.000 -.0032227 -.0013471 
educyr .0145113 . 0084008 1.73 0.084 -.0019538 . 0309765 
actlim . 215784 .0423114 5.10 0.000 . 1328551 . 2987128 
totchr . 2770506 .0145983 18.98 0.000 . 2484385 . 3056627 
private . 7078763 . 2641085 2.68 0.007 . 1902332 1.225519 
_cons -12.6812 2.756927 -4.60 0.000 -18.08468 -7.277722 
private 
medicaid - . 3934477 .017341 -22.69 0.000 -.4274355 -.35946 
age -.0831201 . 0293374 -2.83 0.005 -.1406203 -.0256198 
age2 . 0005257 .0001956 2.69 0.007 .0001423 . 0009092 
educyr .0212523 . 0020467 10.38 0.000 .0172408 .0252637 
actlim - . 0300936 .0176658 -1.70 0.088 -.0647179 . 0045306 
totchr .0185063 .005736 3.23 0.001 007264 .0297487 
income 0027416 . 000473 5.80 0.000 .0018145 . 0036687 
ssiratio - .0647637 0210919 -3.07 0.002 -.1061032 -.0234243 
_cons 3.531058 1.094469 3.23 0.001 1.385939 5.676177 
/c_private - .5478738 . 2655137 -2.06 0.039 -1.068271 -.0274764 


Instrumented: private 
Instruments: medicaid age age2 educyr actlim totchr income ssiratio 


The coefficient of the endogenous regressor private is 0.708, closest to the 
multiplicative GMM estimates, and has standard error 0.264, which is the 
smallest among the various estimators considered in this section. 


20.7.4 Weak instruments 


If instruments are weakly correlated with the endogenous regressor, then the 
usual asymptotic theory may perform poorly. Then alternative methods may 
be used. Inference for weak instruments for the linear model is presented in 

section 7.7. 


The discussion there includes methods based on minimum distance 
estimation due to Magnusson (2010) that can be applied to a wide range of 
linear and nonlinear structural models. Magnusson’s article provides 
simulations and an application to a Poisson model with endogenous 
regressor and weak instruments using both his proposed methods for a 
structural model and weak instrument methods due to Stock and 
Wright (2000) for nonlinear GMM. 


20.8 Clustered data 


Section 13.9 presented in some detail various models and methods that can 
be applied when observations in the same cluster are correlated and 
observations in different clusters are uncorrelated. That discussion was 
illustrated using the Poisson model. Here we provide a brief summary that 
covers the broad range of count models, and not just the Poisson model. 


The simplest approach is to continue to use the poisson and nbreg 
commands but to use the vce (cluster clustvar) option. This maintains 
the assumption that the conditional mean for individual 7 in cluster g is 
E(Yig|Xig) = exp(x;,3), the essential condition for consistency of these 
estimators. But it provides corrected standard errors that adjust for the loss 
in precision that arises because of observations no longer being independent 
within cluster. As with any cluster—robust estimate of standard errors, it is 
assumed that there are many clusters. Similarly, the nonlinear rv methods 
for models with endogenous regressors can continue to be used, along with 
the vce (cluster clustvar) option. 


Potentially, more efficient estimates can be obtained by estimating using 
nonlinear feasible generalized least squares, assuming equicorrelation 
within cluster. For the Poisson model, this population-averaged approach 
uses the xtgee command with options family (poisson), link (log), and 
corr (exchangeable). It is good practice to add the vce (robust) option 
because this provides cluster—robust standard errors that guard against 
within-cluster correlation not being exactly one of equicorrelation. An 
equivalent command is xtpoisson, pa corr (exchangeable) 
vce (robust). One must first use the xtset command to define the cluster 
variable. Similarly, one can use the command xtnbreg, pa 


corr (exchangeable) vce (robust). 


Another way to obtain potentially more efficient estimates is to 
introduce a random cluster-specific intercept, so 
E(Yig|Xig, Wg) = CXP(Ag + X};,8), where Og is an independent and 
identically distributed error. For the Poisson, a closed-form solution arises if 
exp(a,) is gamma distributed. Then one can use the xtpoisson, re 


command. Alternatively, % may be normally distributed, in which case one 
uses the xtpoisson, re normal command. For the NB model, the only 
option is for &g to be normally distributed, in which case one uses the 
xtnbreg, re command. Note that for models with an exponential 
conditional mean, conditioning out the random effect again yields an 
exponential conditional mean because 

E(Yig|Xig) = E{exp(ag)} x exp(x},8) = exp[In E{exp(ag) } + x;,H], 
so only the intercept is changed. 


In general, estimator consistency in a random-effects (RE) model 
requires that the RE distribution be correctly specified. This requirement is 
not necessary for the Poisson with gamma-distributed random effects, 
though we need to use the vce (robust) option. 


The Poisson RE model with normally distributed random effects can be 
extended to cluster-specific random-slope models using the mepoisson 
command. 


FE estimation leads to consistent estimation when there are many 
observations within each cluster. The simplest approach in a regular count 
regression is to simply add i.clid as regressors, where clia denotes the 
cluster identifier variable. This brute-force approach becomes impractical 
when there are many clusters. Instead, more creative methods are used. For 
example, the community-contributed command poi2hdfe 
(Guimaraes 2014) can fit a Poisson model with many fixed effects in two 
dimensions. 


When there are few observations per cluster, consistent estimation of an 
FE model is generally no longer possible, because of the incidental 
parameters; see section 22.2.1. It is possible for the Poisson model, using 
the xtpoisson, fe command. It is not possible for the NB model—recent 
literature has led to the view that a model initially proposed as an FE 
NB model is not really an FE model; see Guimaraes (2008). One approach 
due to Mundlak (1978) is to include the cluster means of regressors as 
additional regressors; see section 6.6.5 in the linear case. 


The same issues arise with panel data. In that case, each individual is a 
cluster, and the observations for a given cluster are data for each time 


period that the individual is observed. A longer discussion for panel data on 
count outcomes is given in section 22.6. 


20.9 Quantile regression for count data 


Conditional quantile regression (QR) is usually applied to continuous- 
response data because the quantiles of discrete variables are not unique 
because the cumulative distribution function is discontinuous with discrete 
jumps between flat sections. By convention, the lower boundary of the 
interval defines the quantile in such a case. 


Conditional QR for continuous dependent variable has been extended to a 
special case of a discrete variable model—the count regression. The method, 
proposed by Machado and Santos Silva (2005), enables QR methods to be 
applied by suitably smoothing the count data. 


20.9.1 Quantile count regression 


The key step in the quantile count regression (QCR) model of Machado and 
Santos Silva is to replace the discrete count outcome y with a continuous 
variable, z = h(y), where h(-) is a smooth continuous transformation. The 
standard linear QR methods are then applied to z. Point and interval estimates 
are then retransformed to the original y scale by using functions that 
preserve the quantile properties. 


The particular transformation used is 
zZ=ytu 


where u ~ U(0, 1) is a pseudorandom draw from the uniform distribution on 
(0, 1). This step is called “jittering” the count. 


Because counts are nonnegative, conventional count models are based on 
an exponential model for the conditional mean, exp(x’), rather than a 
linear function x’. Let Qg(y|x) and Q,(z|x) denote the gth quantiles of the 
conditional distributions of y and z, respectively. Then, to allow for the 
exponentiation, we specify the conditional quantile for Q,(z|x) as 


Qq(z|x) = q + exp(x’G,) (20.18) 


The additional term q appears in the equation because Q,(z|x) is bounded 
from below by q because of the jittering operation. 


To estimate the parameters of a quantile model in the usual linear form 
x’ 3, we apply a log transformation so that ln(z — q) is modeled, with the 
adjustment that if z — q < 0, then we use In(e), where £ is a small positive 
number. The transformation is justified by the property that, in a correctly 
specified model, quantiles are equivariant to monotonic transformation (see 
section 15.2.1) and by the property that quantiles above the censoring point 
are not affected by censoring from below. Postestimation transformation of 
the z quantiles back to y quantiles uses the ceiling function, with 


Qa(ylx) = [Qq(z|x) — 1] (20.19) 


where the symbol |r] in the right-hand side of (20.19) denotes the smallest 
integer greater than or equal to r. For example, [2.4] = 3. 


To reduce the effect of noise due to jittering, we estimate the parameters 
of the model multiple times using independent draws from the U (0, 1) 
distribution and average the multiple estimated coefficients and confidence 
interval endpoints. Hence, the estimates of the quantiles of y counts are 
based on Q,(y)x) = [Qq(zlx) — 1] = [a + exp(x’B,) — 1]. where 5 
denotes the average over the jittered replications. 


20.9.2 The qcount command 
The QcR method of Machado and Santos Silva (2005) can be performed by 


using the community-contributed qcount command (Miranda 2006). The 
command syntax is 


qcount depvar | varlist | lif | lin], quantile(number) | repetition (integer) | 


where quantile (number) specifies the quantile to be estimated and 
repetition (integer) specifies the number of jittered samples to be used to 
calculate the parameters of the model with the default value being 1,000. 
The postestimation command gcount_mfx computes MEs for the model, 
evaluated at the means of the regressors. 


For example, qcount y x1 x2, q(0.5) rep(500) estimates a median 
regression of the count y on x1 and x2 with 500 repetitions. The subsequent 
command qcount_ mfx gives the associated MES. 


20.9.3 Doctor-visits data 


We illustrate these commands using a dataset on the annual number of 
doctor visits (docvis) by the Medicare elderly in the year 2003. The dataset 
covers the same individuals as in the doctor-visits dataset used earlier in the 
chapter, though with nonoverlapping sets of variables. 


The regressors used here are an indicator for having private insurance 
that supplements Medicare (private), number of chronic conditions 
(totchr), age in years (age), and indicators for female and white. 


We have 


. * Summary statistics for doctor-visits data 
. qui use mus220meps2003qr, clear 


summarize docvis private totchr age female white, separator (0) 


Variable Obs Mean Std. dev. Min Max 


docvis 3,677 6.822682 7.394937 (0) 144 
private 3,677 . 4966005 . 5000564 (0) 1 
totchr 3,677 1.843351 1.350026 (0) 8 
age 3,677 74.24476 6.376638 65 90 
female 3,677 -6010335 . 43897525 0 1 
white 3,677 . 9709002 . 1681092 (0) 1 


The dependent variable, annual number of doctor visits (docvis), is a count. 
The frequency distribution, not presented, shows that median number of 
visits is only 5, but there is a long right tail. Around 0.5% of individuals 
have over 40 visits, and the maximum value is 144. 


To demonstrate the smoothing effect of jittering on a count variable, we 
create the variable docvisu, which is obtained for each individual by adding 
a random uniform variate to the docvis variable. We then compare the 
quantile plot of the smoothed docvisu with that for the discrete count 
docvis. For graph readability, values in excess of 40 were dropped. We have 


. * QR: Generate jittered values and compare quantile plots 
. set seed 10101 


. generate docvisu = docvis + runiform() 


. qui qplot docvis if docvis < 40, recast(line) lwidth(medthick) 
> ytitle("Quantiles of doctor visits") saving(graph1, replace) 


. qui qplot docvisu if docvis < 40, recast(line) lwidth(medthick) 
> ytitle("Quantiles of jittered doctor vists") saving(graph2, replace) 


. graph combine graphi.gph graph2.gph, iscale(1.3) rows(1) 
> ycommon xcommon ysize(2.5) xsize(6) 


Figure 20.4 shows the step function for the quantile plot of the original 
count in the first panel and a much smoother quantile plot for the jittered 
data. 
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Figure 20.4. Quantile plots of count docvis (left) and its jittered 
transform (right) 


20.9.4 Results for NB 


The common starting point for regression analysis of counts is Poisson or NB 
regression. We use the latter and simply print out the MEs of a change in the 
conditional mean of a change in each regressor, evaluated at sample means 
of the regressors. 


. * QCR: MEs from conventional negative binomial model 


. qui nbreg docvis private totchr age female white, vce(robust) 


. Margins, dydx(*) atmeans noatlegend 


Conditional marginal effects Number of obs = 3,677 
Model VCE: Robust 
Expression: Predicted number of events, predict() 
dy/dx wrt: private totchr age female white 
Delta-method 
dy/dx std. err. Zz P>|zl [95% conf. interval] 
private 1.080582 . 2137656 5.05 0.000 - 6616087 1.499555 
totchr 1.885011 0771011 24.45 0.000 1.733896 2.036127 
age .0340016 .0176656 1.92 0.054 - .0006224 0686255 
female -. 1398282 .2169893 -0.64 0.519 -.5651194 . 2854631 
white . 5095404 .6164653 0.83 0.408 -. 6987094 1.71779 


The MEs are computed at the mean, and factor-variable notation is not used 
to allow direct comparison with the qcount command, which does not 
support factor-variable notation and reports only the MEM. 


20.9.5 Results for QCR 


We estimate the parameters of the QCR model at the conditional median. We 


obtain 


. * QCR for the median 
. set seed 10101 


. qcount docvis private totchr age female white, q(0.50) rep(500) 


Count Data Quantile Regression 
( Quantile 0.50 ) 


Number of obs = 3677 

No. jittered samples = 500 

docvis | Coefficient Std. err. Zz P>|z| [95% conf. interval] 
private . 2025494 .0410296 4.94 0.000 . 1221328 . 282966 
totchr . 3463796 .0184165 18.81 0.000 . 3102838 . 3824753 
age .0083959 .0033845 2.48 0.013 .0017625 .0150294 
female .0022749 .0412377 0.06 0.956 -.0785495 . 0830993 
white .1195614 .0997216 1.20 0.231 -.0758895 . 3150122 
_cons .0372709 . 2526735 0.15 0.883 -.4579601 .5325019 


The statistically significant regressors have the expected signs. The 
parameters of the model estimated use an exponential functional form for the 
conditional quantile, so, for example, one more year of aging is associated 
with an 0.84% increase in the conditional median. 


To interpret results, we use the MEs, which is easier. The gcount_mfx 
command gives two sets of MEMs after conditional QR. The first is for the 
jittered variable Q,(z|x), and the second is for the original count Q,(y|x). 
We have 


. * QCR: MEs after QCR for the median 
. qcount_mfx 


Marginal effects after qcount 
y = Qz(0.50/X) 
= 5.05925 (0.0975) 


ME Std. Err. Zz P>lz| [ 95% C.I ] X 
private .92569098 .18616845 4.97 0.0000 0.5608 1.2906 0.50 
totchr 1.5792327 .07969824 19.8 0.0000 1.4230 1.7354 1.84 
age .03827918 .01532564 2.5 0.0125 0.0082 0.0683 74.24 
female .01036951 . 1879358 .0552 0.9560 -0.3580 0.3787 0.60 
white .51557508 .40793735 1.26 0.2063 -0.2840 1.3151 0.97 


Marginal effects after qcount 
y = Qy(0.50/X) 
= 5 


ME [95% C. Set] X 


private 0 Oo 1 0.50 
totchr 1 1 1 1.84 
age 0 o O 74.24 
female 0 -1 0 0.60 
white 0 -1 1 0.97 


The first set of output gives the estimated Mems for the conditional median 
Qo.5(z|x) of the jittered variable defined in (20.18). These differ by around 
20% from those for the conditional mean aside from a much greater change 
for the quite statistically insignificant regressor female. The standard errors 
of the MEMs are of similar magnitude. The second set of output gives the 
estimated MEMs for the conditional quantile of the original discrete count 
variable Qo.5(y|x) defined in (20.19). These are discretized because the 
count variable is discrete, and only that for totchr differs from zero. We 
note in passing that if we fit the model using greg rather than qcount, then 
the estimated coefficients are 1 for private, 2 for totchr, and 0 for the other 
three regressors, and all standard errors are 0. 


QCR at the 75th quantile 


The gcount command allows one to study the impact of a regressor at 
different points in the distribution. To explore this point, we reestimate with 
q = 0.75. We have 


. * QCR: MEs after QCR for q = 0.75 
. set seed 10101 


. qui qcount docvis private totchr age female white, q(0.75) rep(500) 
. qcount_mfx 


Marginal effects after qcount 
y = Qz(0.75|X) 
9.06615 (0.1603) 


ME Std. Err. Zz P>lzl [ 95% C.I ] X 
private 1.2309624 .33176595 3.71 0.0002 0.5807 1.8812 0.50 
totchr 2.3250783 .13360258 17.4 0.0000 2.0632 2.5869 1.84 
age .02662164 .02552073 1.04 0.2969 -0.0234 0.0766 74.24 
female -.0089293 .32846865 -.0272 0.9783 -0.6527 0.6349 0.60 
white 1.1806242 .80458387 1.47 0.1423 -0.3964 2.7576 0.97 


Marginal effects after qcount 


Qy (0.751X) 
9 


y 


ME [95% C. Set] X 


private 1 Oo 1 0.50 
totchr 2 2 2 1.84 
age 0 o o 74.24 
female 0 -1 0 0.60 
white 1 -1 2 0.97 


For the highly statistically significant regressors, private and totchr, the 
MEMS are 30-50% higher than those for the conditional median. As expected, 
there is less precision at the 75th quantile than at the median, and the 
standard errors of the MEMs are 60—100% higher. 


20.10 Additional resources 


The single-equation Stata commands [R] poisson and [R] nbreg (for nbreg 
and gnbreg) cover the basic count regression. See also [R] poisson 
postestimation and [R] nbreg postestimation for guidance on testing 
hypotheses and calculating MEs. For zero-inflated and truncated models, see 
[R] zip, [R] zinb, [R] tpoisson, and [R] tnbreg. For Poisson regression with 
endogenous treatment effects, see [TE] etpoisson. 


For fitting hurdle models, the community-contributed hplogit and 
hnblogit commands are relevant. The community-contributed prvalue, 
prcounts, and countfit commands are useful for model evaluation and 
comparison. Stata 15 introduced the fmm prefix, which supports finite 
mixture models of count data as well as variants like zero-inflated models. 
Estimation of some additional parametric count models using the gsem 
command is presented in section 23.6. The community-contributed gcount 
command enables QCR. For clustered and panel count-data analysis, the 
basic commands [XT] xtpoisson and [xT] xtnbreg are covered in 
chapter 22. Deb and Trivedi (2006) provide the mt reatnb command for 
estimating the parameters of a treatment-effects model that can be used to 
analyze the effects of an endogenous multinomial treatment (when one 
treatment is chosen from a set of more than two choices) on a nonnegative 
integer-valued outcome modeled using the NB regression. 


Cameron and Trivedi (2013) provide a detailed analysis of count 
regression, and that book’s website 
(http://cameron.econ.ucdavis.edu/racd2/) provides additional examples of 
Stata code for count regression. 


20.11 Exercises 


1. Consider the Poisson distribution with u = 2 and a multiplicative 
mean-preserving lognormal heterogeneity with a variance of 0.25. 
Using the pseudorandom generators for Poisson and lognormal 
distributions and following the approach used for generating a 
simulated sample from the NB2 distribution, generate a draw from the 
Poisson—lognormal mixture distribution. Following the approach of 
section 20.2.2, generate another sample with a mean-preserving 
gamma distribution with a variance of 0.25. Using the summarize, 
detail command, compare the quantiles of the two samples. Which 
distribution has a thicker right tail? Repeat this exercise for a count- 
data regression with the conditional mean function 
u(x) = exp(1 + 1x), where x is an exogenous variable generated as a 
draw from the uniform(0, 1) distribution. 

2. For each regression sample generated in the previous exercise, 
estimate the parameters of the NB2 model. Compare the goodness of fit 
of the NB2 model in the two cases. Which of the two datasets is better 
explained by the NB2 model? Can you explain the outcome? 

3. Suppose it is suggested that the use of the tpoisson command to 
estimate the parameters of the zTP model is unnecessary. Instead, 
simply subtract 1 from all the counts y, replacing them with 
y* = y — 1, and then apply the regular Poisson model using the new 
dependent variable y*; E(y*) = E(y) — 1. Using generated data from 
Poisson{ u(x) = 1+ z}, x = uniform(0, 1), verify whether this 
method is equivalent to the tpoisson command. 

4. Using the finite-mixture prefix (fmm), fit two- and three-component NB2 
mixture models for the univariate (intercept only) version of the 
docvis model. Use the BIC to select the “better” model. For the 
selected model, use the estat 1cmean command to compute the means 
of the m components. Explain and interpret the estimates of the 
component means and the estimates of the mixing fractions. Is the 
identification of the two and three components robust? Explain your 
answer. 

5. Continuing with the previous example of a two-component mixture, 
use the command estat 1cprob to generate the latent class posterior 


probability distributions. Use histograms to compare the distributions. 
Verify that the means of these two component distributions equal 
mixture proportions, respectively. 

. For this exercise, use the data from section 20.6. Estimate the 
parameters of the Poisson and zip models using the same covariates as 
in section 20.6. Test whether there is statistically significant 
improvement in the log likelihood. Which model has a better BIC? 
Contrast this outcome with that for the NB2/ZINB pair and rationalize the 
outcome. 

. Consider the data application in section 20.7.2. Drop all observations 
for which the medicaid variable equals one, and therefore also drop 
medicaid as a covariate in the regression. For this reduced sample, 
estimate the parameters of the Poisson model, treating the private 
variable first as exogenous and then as endogenous. Obtain and 
compare the two estimates of the ME of private on docvis. Implement 
the test for endogeneity given in section 20.7.2. 

. Use the data from section 20.9, except let the dependent count variable 
be totchr, and drop totchr from the regressors. Using the 
community-contributed qcount command, estimate gcount regressions 
for q = 0.25, 0.50, and 0.75, and use qcount_mfx to calculate the MEs. 
Store and print the results in tabular form. Explain how this command 
works in the case of gcount regressions and whether there are any 
differences in the interpretation compared with the standard Poisson 
regression. 


Chapter 21 
Survival analysis for duration data 


21.1 Introduction 


Duration data or survival data or failure data are data on a variable that 
measures the length of time spent in a state before transition to another 
state. An economics example is the length of time spent unemployed before 
transition to a job or withdrawal from the workforce. A biomedical example 
is the number of years a person survives after joining a medical study. A 
quality control example is the number of hours a light bulb functions before 
failure. 


If spells are completely observed, then the regression methods of the 
previous chapters are directly applicable. A conditional mean approach 
specifies an exponential form for the conditional mean because durations 
are strictly positive. From stochastic process theory, a natural starting point 
is the exponential distribution. There is no explicit exponential estimation 
command in Stata. But exponential regression falls in the generalized linear 
models class, and the command glm y x, family(gamma) scale(1) 
link(log) vce(robust) for duration data on completed spells is analogous 
to the command poisson y x, vce(robust) for count data. A fully 
parametric approach considers more flexible distributions than the 
exponential, such as the Weibull and Gompertz distributions. This is 
analogous to using the negative binomial distribution rather than the 
Poisson for count data. 


A major complication for duration data is that a range of commonly 
used sampling schemes leads to the spell length being incompletely 
observed, so the dependent variable is censored. For example, in 
biostatistics analysis of survival following treatment or onset of a medical 
condition, if individuals are followed for five years, then not all spells are 
complete if some individuals survive more than five years or if some 
individuals leave the study. A fully parametric approach, analogous to a 
tobit model with right-censoring, has the limitation of heavy reliance on 
distributional assumptions. 


Instead, the biostatistics literature emphasizes a semiparametric 
approach, the Cox proportional hazards (PH) model, that is uniquely 


applicable to duration data. This approach brings in terms and concepts not 
used in the analysis of other types of data, such as the survivor function, 
hazard function, and cumulative hazard function. The methods are 
sufficiently specialized that Stata provides a separate set of commands that 
begin with the letters st (for survival time). 


Additionally, an important issue that often arises in microeconometrics 
applications with duration data is distinguishing between the roles of 
individual heterogeneity and duration dependence in determining spell 
length. For example, we may observe that individuals with longer 
unemployment spells are less likely to exit unemployment in the next 
period. One reason for this phenomenon is individual heterogeneity in 
ability to be reemployed because of individual variation in, for example, job 
skills. An alternative reason is that being unemployed for a long time by 
itself harms the probability of reemployment. These two explanations have 
quite different policy implications but are difficult to disentangle. 


This chapter provides an introductory treatment of these special features 
of duration data analysis for the simplest case of randomly censored 
duration data where only one spell is observed per individual and, in most 
examples, individual regressors are constant throughout the spell. We begin 
with the popular Cox PH model, which is a semiparametric model that 
focuses on the role of individual observed heterogeneity. We then present 
maximum likelihood (ML) estimation of a range of parametric regression 
models that differ in their specification of the roles of observed individual 
heterogeneity and duration dependence. The chapter concludes with the 
discrete-time hazards binary outcome model that, like the Cox model, relies 
on weaker assumptions than ML estimation. 


21.2 Data and data summary 


The example for this chapter models the length of time without a job, where 
individuals are observed at the beginning of the spell but are not necessarily 
observed for the full length of the spell. 


21.2.1 Summary statistics 


The data on the length of jobless spells are those analyzed in Cameron and 
Trivedi (2005, 603—608) and originally analyzed by McCall (1996), who 
provides complete details. 


The dependent variable spe11 is the number of periods jobless 
(measured in two-week intervals). 


The spells are censored from above. A spell is considered to be complete 
when a person moves to a full-time job; otherwise, the spell is censored. The 
binary variable censor1 equals 1 if a person becomes reemployed at a full- 
time job (so is uncensored) and equals 0 (so is censored) otherwise. 


The dataset includes many regressors. We focus on two regressors, an 
indicator variable for whether the person filed an unemployment insurance 
claim (variable ui) and log weekly earnings before the jobless spell (variable 
logwage). These regressors do not vary through the spell. 


We begin with variable description and summary statistics. 


. * Unemployment spell length: Describe and summarize key variables 
. qui use mus221imccall, clear 


. describe spell censor1 ui logwage 


Variable Storage Display Value 
name type format label Variable label 
spell int 48 .0g Periods jobless: two-week intervals 
censor1 int 28 .0g Reemployed at full-time job 
ui int 28 .0g Filed UI claim 
logwage float 28 .0g log weekly earnings 
. Summarize spell censori ui logwage 
Variable Obs Mean Std. dev. Min Max 
spell 3,343 6.247981 5.611271 1 28 
censor1i 3,343 . 3209692 .4669188 (0) 1 
ui 3,343 .5527969 .4972791 0 1 
logwage 3,343 5.692994 .5356591 2.70805 7.600402 


There are observations on 3,343 individuals. Only 32% of spells are 
observed to completion; the remaining spells are censored. The mean spell 
length, averaged over individuals with either complete or incomplete spells, 
is 6.2 periods. This most likely understates the mean length of completed 
spells because so many spells are not observed to completion. A bit more 
than one-half of the individuals filed an unemployment insurance claim. 


It is helpful to list the first few observations. 


* List the first six observations of key variables 
list spell censori ui logwage in 1/6, clean 


spell 
5 

13 

21 

3 

9 

11 


OnRWNRe 


censori ui 
1 


COOrRrFRFR 
erererrro 


logwage 


6. 
. 28827 
. 76734 
. 973889 
.31536 
. 85435 


QOON 


89568 


The first four spells were complete and of lengths, respectively, 5, 13, 21, 
and 3 periods. The 5th and 6th spells were incomplete, so they lasted at least 


9 and 11 periods, respectively. 


21.2.2 The stset command 


Stata commands for duration analysis begin with st and are called st 
(survival-time) commands. The help st command provides a complete list 
of the various commands. 


To use these commands, one must first declare the data to be survival 
data using the stset command. At a minimum, the dependent variable needs 
to be identified. Additionally, if data are censored, the censoring variable 
needs to be identified. 


Data may be single-record data, with one observation per individual, or 
multiple-record data, with more than one observation per individual. And 
data may be single-failure data, with at most one failure per record, or 
multiple-failure data. The current example is one of single-record, single- 
failure data. 


For single-spell data, the stset command has syntax 
stset timevar [ if | | weight | [ À options | 


The main option is the failure () option, which declares the censoring 
variable, if there is censoring. The st commands call the completion of a 
spell a failure. This uses the language of biostatistics, where a spell may end 
with death, or of quality control, where a spell may end with a light bulb 
blowing. In many economics contexts, such as the current one, the ending of 
a spell is actually a success. 


In the simplest case, the spell begins, and the subject is at risk at time 0. 
The origin() option defines when a subject becomes at risk, the enter () 
option specifies when a subject first enters the study, and the exit () option 
specifies when a subject exits the study. 


Once the dependent variable and, if applicable, the censoring variable are 
declared, there is no need to include them in subsequent commands. For 
example, for parametric regression of y on x with censored data, the 
command is streg x rather than reg y x. 


For the current data, we have 


* Command stset defines the dependent and censoring variables 
stset spell, fail(censor1=1) 


Survival-time data settings 


Failure event: censori== 
Observed time interval: (0, spell] 
Exit on or before: failure 


3,343 total observations 
O exclusions 


3,343 observations remaining, representing 
1,073 failures in single-record/single-failure data 
20,887 total analysis time at risk and under observation 


At risk from t = 0 
Earliest observed entry t = 0 
Last observed exit t = 28 


The option fail (censorl=1) means that observations on spell will be 
viewed as uncensored (a failure) if censor1 equals one and otherwise will 
be viewed as censored. Note that missing values of censor1 will then be 
considered censored. 


There are 1,073 complete spells (a “failure” in Stata terminology) out of 
3,343, so a proportion 1073/3343 = 0.321 of spells are complete, as was 
given in the original summary statistics. The total analysis time at risk is the 
sum over individuals of the spell length for each individual. This equals the 
number of individuals times the average spell length, here 
3343 x 6.248 = 20887. 


21.2.3 Survival data organization 


The stdescribe command can be used to obtain additional information. 


. * Survival description of dataset 


. stdescribe 
Failure _d: censor1== 
Analysis time _t: spell 
Per subject 

Category Total Mean Min Median Max 
Number of subjects 3343 

Number of records 3343 1 1 1 1 
Entry time (first) 0 0 0 0 
Exit time (final) 6.247981 1 5 28 
Subjects with gap 0 

Time on gap 0 

Time at risk 20887 6.247981 1 5 28 
Failures 1073 . 3209692 0 0 1 


There is exactly one spell for each individual. There are no gaps in time in 
the data. The spell lengths range from 1 period to 28 periods 


The st commands allow multiple spells per individual. In this 


introductory treatment, we consider only single-spell data. 


The st vary command shows whether regressors vary during the spell 
and are missing in some spells. This is applicable for use with multirecord 
data. We illustrate the command, even though here the data are single-spell 


data. 


* Variation in regressors over time - relevant for multiple-record data 


stvary ui logwage 


censor1== 
spell 


Failure _d: 


Analysis time _t: 


Subjects for whom the variable is 


never always sometimes 

Variable constant varying missing missing missing 
ui 3343 0 3343 0 0 
logwage 3343 0 3343 0 0 


There are no missing values for the two regressors. 


The stsum command provides summary statistics for the dependent 
variable. Here we use the by() option to additionally provide these statistics 


by unemployment insurance status. 


. * Summary of survival data by insurance status 
. stsum, by(ui) 


Failure _d: censor1== 


Analysis time _t: spell 


Incidence Number of Survival time 
ui | Time at risk rate subjects 25%, 50% 75% 
0 6,135 . 0938875 1495 2 9 
1 14,752 . 0336903 1848 9 20 
Total 20,887 .0513717 3343 5 15 


Considering the final total row, we see there were 20,887 periods at risk. 
From the stdescribe command, there were 1,073 failures, so the incidence 
rate is 1073/20887 = 0.051. On average in each period, 5.1% of the 
ongoing spells resulted in failure. 


The median survival time of 15 periods is computed using methods 
detailed in the next section. Because of the extent of censoring and 
consequent incomplete spells, the 75th percentile of survival time cannot be 
computed. 


Comparing these statistics by unemployment insurance status, we clearly 
see that individuals who filed an unemployment insurance claim (ui=1) are 
much less likely to exit to a full-time job. 


21.3 Survivor and hazard functions 


The graphs commonly used to display the distribution of a single variable 
are histograms and kernel density estimates. These are not as useful when 
the distribution is not completely observed because of censoring, here 
censoring from above. 


Instead, the standard graphs for censored duration data are graphs of the 
survivor function, cumulative hazard function, and hazard function. These 
are presented in this section. 


21.3.1 Densities for complete and incomplete spells 


For completeness, we begin with histograms and kernel density estimates 
for the dependent variable, partitioned by whether spells are complete or 
incomplete. 


. * Graph histogram and density of survival data by ui status 
qui graph twoway (hist spell if censor1==1) 

(kdensity spell if censori==1, lwidth(thick) lstyle(p1)), 
> legend(pos(1) col(1) ring(0)) title("Completed spells") 


Ve 


qui graph twoway (hist spell if censor1==0) 
(kdensity spell if censor1==0, lwidth(thick) lstyle(p1)), 


Ve 


> legend(pos(1) col(1) ring(0)) title("Incomplete spells") 
Completed spells Incomplete spells 
mJ HE Density m HE Density 


kdensity spell 


kdensity spell 


Figure 21.1. Histograms for complete and incomplete spells 


Figure 21.1 presents the graph. Shorter spells are more common than 
longer spells and, compared with the incomplete spells, the completed 
spells show a greater proportion of spells that are short. 


21.3.2 Survivor function 


The standard notation in the survival literature is to denote the dependent 
random variable by T, for time, rather than the usual Y. The survivor 
function is 


S(t) = Pr(T >t) 


and is equal to one minus the cumulative distribution function (c.d.f.). 


The standard estimate of S(t) is the Kaplan-Meier nonparametric 
estimate of the survivor function. Command sts list provides and lists this 
estimate of the survivor function. 


* Compute survivor function 


sts list 


Failure 
Analysis time 


Kaplan-Meier survivor function 


d: censor1== 


t: spell 


At Survivor Std. 

Time risk Fail Lost function error [95% conf. int.] 
1 3343 294 246 0.9121 0.0049 0.9019 0.9212 
2 2803 178 304 0.8541 0.0062 0.8415 0.8659 
3 2321 119 305 0.8103 0.0071 0.7960 0.8238 
4 1897 56 165 0.7864 0.0076 0.7712 0.8008 
5 1676 104 233 0.7376 0.0085 0.7206 0.7538 
6 1339 32 111 0.7200 0.0088 0.7023 0.7369 
7 1196 85 178 0.6688 0.0098 0.6492 0.6876 
8 933 15 70 0.6581 0.0100 0.6380 0.6773 
9 848 33 98 0.6325 0.0106 0.6113 0.6528 

10 717 3 55 0.6298 0.0106 0.6086 0.6503 
11 659 26 77 0.6050 0.0113 0.5825 0.6267 
12 556 7 40 0.5974 0.0115 0.5744 0.6195 
13 509 25 69 0.5680 0.0123 0.5434 0.5918 
14 415 30 74 0.5270 0.0135 0.5001 0.5531 
15 311 19 40 0.4948 0.0146 0.4658 0.5230 
16 252 10 41 0.4751 0.0153 0.4449 0.5047 
17 201 8 24 0.4562 0.0161 0.4245 0.4874 
18 169 7 13 0.4373 0.0169 0.4040 0.4702 
19 149 4 15 0.4256 0.0174 0.3912 0.4595 
20 130 3 18 0.4158 0.0179 0.3804 0.4507 
21 109 4 23 0.4005 0.0188 0.3635 0.4372 
22 82 4 9 0.3810 0.0203 0.3412 0.4206 
23 69 (0) 9 0.3810 0.0203 0.3412 0.4206 
24 60 (0) 2 0.3810 0.0203 0.3412 0.4206 
25 58 (0) 10 0.3810 0.0203 0.3412 0.4206 
26 48 2 13 0.3651 0.0223 0.3214 0.4088 
27 33 5 24 0.3098 0.0296 0.2528 0.3684 
28 4 (0) 4 0.3098 0.0296 0.2528 0.3684 


The survivor function estimate is computed as follows. At time 1, 294 
out of 3343 observations failed, so a proportion 294/3343 = 0.0879 failed 
and 1 — 0.0879 = 0.9121 survived. At time 2, there were 2803 observations 
at risk because 294 observations were lost at time 1 because of failure and 


246 were lost because of random censoring. Of these 2,803 at-risk 


observations, 178 then failed, so at time 2 a proportion 178/2803 = 0.0635 


failed and 1 — 0.0635 = 0.9365 survived. Cumulatively 
0.9121 x 0.9365 = 0.8541 survived to the end of period 2, and so on. 


The precise formula for the Kaplan—Meier estimate of the survivor 
function is 


j\t; <t 


where t;, 7 = 1,... denotes the times at which failures occur, n; denotes 
the number of individuals at risk of failure just before time t;, and dj 
denotes the number of failures at time ¢;. In the current example, failures 
occur at consecutive time periods 1, 2,..., 28, but the methods are 
applicable to the more general case where failures may not occur in all time 
periods. 


Command sts graph provides a plot of these estimates. We produce this 
graph with 95% pointwise confidence bands, along with a similar graph 
segmented by unemployment insurance status. 


. * Graph survivor function over all and by ui 
. qui sts graph, survival ci legend(pos(8) col(1) ring(0)) 
. qui sts graph, by(ui) survival ci legend(pos(8) col(1) ring(0)) 
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Figure 21.2. Kaplan—Meier estimate of survivor function 


Figure 21.2 presents the graph. The survivor function can be estimated 
to time t = 28, by which time the probability of remaining without a full- 
time job is estimated to be around 0.3. The survivor curve is higher for 
those who filed an unemployment insurance claim (ui=1). 


21.3.3 Hazard rate and cumulative hazard function 


The hazard rate is the instantaneous probability of the spell ending at time ¢, 
conditional on survival to time ¢. In the current example, the hazard rate at 
time t = 8, for example, is the probability of finding a full-time job in the 
8th period, given that a full-time job was not found in the first 7 periods. 


Let f(t) denote the density of T, let F(t) = Pr(T < t) denote the 
c.d.f., and, as before, let S(t) = Pr(T > t) = 1 — F(t) denote the survivor 
function. The hazard rate h(t) is defined to equal f(t), the instantaneous 
probability of the spell ending, divided by S(t), the probability of the spell 
lasting to time ¢. The hazard function or hazard rate is then 


 Pr(t<T<t+AIT>t) fe) F(t 
ered h “Sh; 1-6 A 


An advantage of modeling h(t) is that, for censored data, it can be 
nonparametrically estimated up to tmax, the maximum observed length of a 
completed spell in the dataset. This does not allow estimation of the mean 
spell length if data are censored. But it is still very informative because the 
hazard rate is of interest per se. In the current example, a hazard rate that is 
decreasing in ¢ implies a serious policy problem because the longer a person 
remains without a full-time job, the less likely he or she is to obtain one. 


A simple estimate of the hazard function is 


A(t) == (21.2) 


the number of failures at time ¢ divided by the number at risk immediately 
before time t. For example, from the preceding output at t = 5, there were 
1,676 observations at risk and 104 failed, so h(5) = 104/1676 = 0.0605. 


This estimate is exceptionally noisy because it is based on only a small 
proportion of the nj observations. It is common to instead produce a kernel- 
smoothed version of the hazard function. This can be obtained using the 
hazard option of the sts graph command. 


A related measure is the cumulative hazard function H(t) = — In S(t), 
with 


H(t) = > Alt) 


jltj <t 


which is estimated by the Nelson—Aalen estimate of the cumulative hazard 
function. 


fi)= it) = oo 


jlt;<t jga 2 


The following code provides estimates of the cumulative hazard 
function and the hazard function, along with 95% confidence bands. 


. * Graph cumulative hazard function and smoothed hazard function 
. local endash = ustrunescape("\u2013") 


. qui sts graph, cumhaz ci legend(pos(11) col(1) ring(0)) 
> title("Nelson* endash“Aalen cumulative hazard") 


. qui sts graph, hazard ci legend(pos(8) col(1) ring(0)) 


Nelson—Aalen cumulative hazard Smoothed hazard estimate Raw hazard estimate 
J ] re) e 


05 
f 
4 


e haz 
Fitted values 


95% CI T 
Smoothed hazard function | e 


9 
ha 95% CI 
Cumulative hazard 


.04 
f 


.03 
f 
.05 


10 20 10 20 
Analysis time Analysis time Time 


Figure 21.3. Kaplan-Meier estimate of survivor function 


From the second panel of figure 21.3, the hazard function is very noisily 
estimated, even after smoothing. The hazard appears to be declining, 
especially for t > 16. 


The third panel of figure 21.3 presents an estimate of the raw hazard 
function before smoothing. This was obtained using the following 
commands. 


* Graph raw hazard rate 
. gui sts list, cumhaz saving(cumhazard, replace) 


. preserve 
. use cumhazard, clear 
qui generate haz = cumhaz - cumhaz[_n-1] 


. qui graph twoway (scatter haz time) (lfit haz time), 
> legend(pos(11) col(1) ring(0)) title("Raw hazard estimate") 


. restore 


21.4 Semiparametric regression model 


We now consider regression, beginning with the Cox PH regression model, 
which does not require complete specification of the conditional density of 
the data. This model focuses on estimating the role of individual observed 
heterogeneity while controlling for duration dependence in a flexible way. 


The model is especially useful in clinical trials that measure the effect on 
the hazard rate of taking a drug because coefficient estimates can be 
interpreted as, for example, taking the drug decreases the hazard of death by 
10%. 


The model is less useful if interest lies in the nature of duration 
dependence over time because its flexibility often leads to imprecise 
estimation of the time-varying portion of the hazard. 


21.4.1 The stcox command 


Estimates of the Cox PH regression model are obtained using the stcox 
command, which has syntax 


stcox | indepvars | [ of | |in] E options | 


An important option is the nohr option, which, as explained in the next 
subsection reports raw coefficients rather than exponentiated coefficients. 
Note that varlist is just a list of the regressors. It is not necessary to provide 
the names of the dependent variable and any censoring variable because these 
are already declared in the stset command. 


The postestimation commands predict, margins, stcurve, stphplot, 
stcoxkm, and estat phtest are presented in this section. 


21.4.2 Cox proportional hazards model 


The Cox PH model defines the hazard rate for dependent variable t, 
conditional on regressors x, for the ith individual to be 


h(tilx:) = ho(ti) x exp(x,B) (21.3) 


The term “proportional hazards” is used because the model restricts the 
hazard rate to be the same baseline hazard rate function ho(-) for all 
individuals. This baseline hazard is then scaled up or down for each 
individual by exp(x/ 6), called the relative hazard. In the simplest case, given 
here, the regressors X; are constant through an individual’s spell. This 
assumption can be relaxed to allow time-varying regressors; see section 21.8. 


The maximum partial-likelihood estimation method for the Cox PH model 
is detailed in, for example, Cameron and Trivedi (2005, 594-596). A major 
feature is that the method is robust to independent censoring, meaning that 
after conditioning on any regressors, the censoring mechanism is independent 
of the duration process for the dependent variable t. For example, there is no 
problem if we randomly lost track of someone in the sample or if we simply 
have not observed them to the end of their spell. 


Estimation yields estimates of the parameters 3. Using calculus methods, 
we see a one-unit change in the jth regressor is associated with the following 
change in the hazard rate: 


Ott; 


= ho(ti) x exp(x}B) x B; = h(ti|xi) x 6; 


It follows that if 8; > 0, then an increase in the regressor is associated with 
an increase in the hazard rate and hence in the likelihood of failure. For 
example, it may be that the higher the wage in the previous job, the more 
likely it is that we will exit from unemployment given the current length of 
the unemployment spell. And if 6; = 28x, for example, then the effect on the 
hazard rate of changing 7; is twice that of changing 7x. 


For censored data, the Cox PH model cannot produce estimates of the 
conditional mean. It provides only estimates of the hazard function and 
related functions such as the survivor function. For duration data, the hazard 
function is of intrinsic interest. In principle, a similar estimator can be applied 
to right-censored expenditure data, in lieu of a tobit model. But for data such 


as expenditure data, the hazard function, the ratio of the conditional density 
to one minus the conditional c.d.f., is of no interest. 


21.4.3 Cox PH estimates 


Estimation using the stcox command with the nohr option yields 


. * Cox PH regression with raw coefficients 
. stcox ui logwage, nohr vce(robust) nolog 


Failure _d: censori== 
Analysis time _t: spell 
Cox regression with Breslow method for ties 
No. of subjects = 3,343 Number of obs = 3,343 
No. of failures = 1,073 


Time at risk = 20,887 
Wald chi2(2) = 287.10 
Log pseudolikelihood = -7847.2989 Prob > chi2 = 0.0000 
Robust 
_t Coefficient std. err. Zz P>|z| [95% conf. interval] 
ui -1.007489 .0609082 -16.54 0.000 -1.126866 -.8881108 
logwage . 3991956 .0525997 7.59 0.000 . 296102 . 5022892 


The hazard rate increases with higher wage in the previous job (variable 
logwage) and decreases for those who have filed for unemployment insurance 
benefits (ui=1). Furthermore, these effects are highly statistically significant. 


The magnitude of these effects is not immediately clear. Consider a one- 
unit change in regressor ©; where we have partitioned 
x’ 3 = exp(3,21 + x53,). Then the hazard rate defined in (21.3) is e81 times 
larger because 


ho(t;) exp{f1 (#1; + 1) + X82} E exp(b1£1i + X5;82) x exp(b1) ae) 
ho(ti) exp((G1r1i + X382) exp(G1 21; + X9,82) : 


The default form of the stcox command reports these interpretable 
exponentiated coefficients ¿ôĝ, rather than 3. Using the default version of the 
stcox command, we obtain 


. * Cox PH regression with exponentiated coefficients 
. stcox ui logwage, vce(robust) nolog 


Failure _d: censori== 
Analysis time _t: spell 


Cox regression with Breslow method for ties 


No. of subjects = 3,343 Number of obs = 3,343 
No. of failures = 1,073 
Time at risk = 20,887 

Wald chi2(2) = 287.10 

Log pseudolikelihood = -7847.2989 Prob > chi2 = 0.0000 

Robust 

_t Haz. ratio std. err. Z P>lz| [95% conf. interval] 

ui . 3651348 .0222397 -16.54 0.000 . 3240471 .4114323 

logwage 1.490625 .0784065 7.59 0.000 1.344607 1.6525 


The underlying estimates are the same as those obtained with the option 
nohr, and the fitted log likelihood is the same. The only difference is that the 
coefficients are now reported as hazard ratios, leading to different standard 
errors and 95% confidence interval. The hazard ratio for variable ui is the 
exponential of the coefficient previously reported because 

exp(— 1.007489) = 0.365135. Note that the z statistics and p-values are the 
same as those given in the preceding output and are interpreted as tests 
against an exponentiated coefficient taking value 1. 


The model results are interpreted as follows. Filing for unemployment 
insurance benefits is associated with the hazard of the unemployment spell 
ending being only 0.365 times what it would be otherwise, a substantial 
decrease. Similarly, a one-unit change in 1ogwage leads to the hazard rate 
being 1.491 times higher. These economic variables clearly have a very large 
effect on the duration of the unemployment spell. 


The estimation results are usually interpreted in this way; marginal effects 
are rarely computed. 


21.4.4 Prediction and marginal effects 


As already noted, the Cox PH estimator controls for censoring using minimal 
assumptions that do not produce estimates of the conditional mean. Instead 
the hazard function is identified. The default nr option of the predict 


postestimation command predicts the relative hazard exp(x/,3), while the xb 
option predicts x/ 6. 


Other predict options produce predictions only for uncensored 
observations with completed spells. The options basehc, basesurv, and 
basechazard produce, respectively, the baseline hazard contributions, 
baseline survivor function, and baseline cumulative hazard. Various residuals 
can be obtained: martingale-like (mgale), Cox—Snell (csne11), deviance 
(deviance), efficient scores (scores), Schoenfeld (schoenfeld), and scaled 
Schoenfeld (scaledsch); see [ST] stcox postestimation. 


The only options of the margins command, aside from 
predict (statistic) , are hr and xb. The default option is hr. The margins, 
dydx(*) command, used extensively elsewhere in this book, is of no use here 
because it gives marginal effects (MEs) with respect to the relative hazard 
exp(x’@), rather than the hazard rate ho(t) x exp(x’@). Instead, we interpret 
the exponentiated coefficients as relative hazard rates. 


21.4.5 The stcurve command 


The stcurve command provides graphs of the survivor, cumulative hazard, 
and hazard functions following several st estimation commands, including 
stcox and streg. For stcox, the graphs are produced only for uncensored 
observations. 


These functions of time, such as the hazard h(t|x) = ho(t) x exp(x’B), 
vary according to the regressor values. The Stata default is to evaluate at the 
mean value of the regressors. The at (varname=#) option instead evaluates 
at values of specified regressors and at the mean of other regressors. 


Here we evaluate at ui=1 and at the mean of logwage. 


. * PH model curves: Survivor, cumulative hazard, and hazard functions 
. gui stcurve, survival at(ui=1) title("Survivor function") 


. qui stcurve, cumhaz at(ui=1) title("Cumulative hazard function") 


. gui stcurve, hazard at(ui=1) title("Hazard function") 


Figure 21.5 presents the graphs. They have similar shape to those 
obtained earlier for variable spe11 in the nonregression case. Bootstrap 


confidence intervals for the survival curve can be obtained using the 
community-contributed bsurvci command (Ruhe 2019). The article provides 
Stata code that can be adapted to other quantities computed by the predict 
postestimation command following the stcox or streg commands. 
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Figure 21.4. Survivor and related curves following Cox PH 
regression 


21.4.6 Diagnostics for the PH model 


The Cox PH model has the attraction of not being fully parametric, but it does 
make the key assumption that the hazard rate h(t|x), which depends on both 
duration (t) and regressors (x), factorizes into two separate components, so 


h(t|x) = ho(t) x exp(x"B). 


Stata provides three diagnostics for the Cox PH model: stphplot, 
stcoxkm, and estat phtest. 


PH imply proportional integrated hazards, with 
H(t|x) = Ho(t) x exp(x’), where H(t) = if h(s)ds. Taking the 
logarithm, In H(t|x) = In Ho(t) + x’. So the log-integrated hazard curves, 
also called the log—log survivor curves, should be parallel at different values 
of the regressors. 


We use the stphplot command to contrast the log—log survivor curves at 
the two different values of ui, controlling for logwage, and at high and low 
wages (using newly created variable highwage), controlling for ui. 


. * PH model diagnostics: Check for parallel log-log survival curves 
. qui stphplot, by(ui) adjust(logwage) legend(pos(1) col(1) ring(0)) 
> title("Log’ endash°log survival by UI") 


. qui summarize logwage, d 
. qui generate highwage = logwage > r(p50) 


. qui stphplot, by(highwage) adjust(ui) legend(pos(1) col(1) ring(0)) 
> title("Log`endash“log survival by wage") 
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Figure 21.5. Check for parallel log-log survivor curves 


Figure 21.5 presents the log-log survivor curves. The curves are parallel 
for low and high wages, but for ui they are not close to parallel, so the PH 
assumption does not appear to be appropriate. The recommendation would be 
to fit separate Cox PH models for the two levels of ui. 


The stcoxkm command plots, for each level of the regressor, both the 
fitted survivor function from Cox regression and the (nonregression) Kaplan— 
Meier estimate of the survivor function. The two survivor curves should be 
similar if the PHis a good model. We do this for both ui and for highwage. 


* PH model diagnostics: Check for good prediction of survival curve 
. qui stcoxkm, by(ui) legend(pos(1) col(1) ring(0)) 
> title("Survival predicted by UI") 


. qui stcoxkm, by(highwage) legend(pos(1) col(1) ring(0)) 
> title("Survival predicted by wage") 
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Figure 21.6. Compare Cox PH to Kaplan—Meier survivor curve 


From figure 21.6, the model is reasonable for highwage but does not do 
so well for ui. 


The final test is a formal statistical test that, unlike the previous graphical 
diagnostics, can be applied to regressors that take many values. These tests 
are based on a quantity computed for each observation and for each regressor 
called the scaled Schoenfeld residual. These residuals should be independent 
of spell length or any function of spell length. 


This test is implemented using the estat phtest command. We use the 
default option of testing whether the residuals are related to spell length. We 
obtain 


* PH model diagnostics: Test of PH assumption 
. qui stcox ui logwage, vce(robust) nolog 


estat phtest, detail 
Test of proportional-hazards assumption 


Time function: Analysis time 


rho chi2 df Prob>chi2 

ui 0.26267 61.73 1 0.0000 

logwage -0.06872 4.14 1 0.0419 
Global test 61.75 2 0.0000 


Note: Robust variance-covariance matrix used. 


There is clearly a problem with variable ui, while the null hypothesis of no 
relationship with spell length is just rejected for variable logwage at 


significance level 0.05. 


The following code presents a visual version of this test by obtaining the 
scaled Schoenfeld residuals and plotting them against spell length. 


. * PH model diagnostics: Graph the Schoenfeld residual against time 


. qui stcox ui logwage, vce(robust) nolog 


. qui predict double rui rlogwage, schoenfeld 


. graph twoway (scatter rui _t) (lfit rui _t), 


> ytitle("Schoenfeld residual for ui") legend(off) 


. graph twoway (scatter rlogwage _t) (lfit rlogwage _t), 
> ytitle("Schoenfeld residual for logwage") 
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Figure 21.7. Graph of Schoenfeld residuals against spell length 


Figure 21.7 presents the resulting graph. It is clear that for the residuals 
corresponding to variable ui, there is a positive relationship with spell length, 
while there appears to be no relationship between 1ogwage and the spell 
length. Again, the Cox PH assumption is inappropriate for variable ui. 


The manual entry [ST] stcox postestimation presents further details on 
diagnostic methods, many of which can also be applied after command 


streg. 


21.4.7 Competing risks models 


The analysis has assumed that the only way to exit from unemployment is to 
obtain a full-time job. In fact, people also exit to other jobs such as a part- 
time job. These other reasons for exiting are known as competing risks. A 


brief presentation is given here; see Cleves, Gould, and Marchenko (2016), 
for example, for a more complete analysis. 


These competing risks have been treated as independent censoring, so 
that in (21.2) the estimated hazard rate h(t) = d;/nj has a risk set ^n; that 
does not include persons who exit due to obtaining a job that is not full-time. 
This smaller value of n; leads to overestimation of the hazard of finding full 
employment because those who have found other employment such as part- 
time employment are unlikely to find a full-time job in the current period. 


Consider two competing risks where interest lies in durations to event 1 
and event 2 is a competing event. Typically, we focus on the survivor 
function, which here is Pr(T > t and event 1 occurs). With competing 
risks, one instead uses the failure function, Pr(T < t and eventx1 occurs), 
which is called the cumulative incidence function (CIF). For example, we 
consider the probability of obtaining a full-time job within ¢ periods. 


An estimate of the cIF can be obtained as follows. The appropriate hazard 
function includes in the risk set not only observations for which event 1 has 
not occurred but also those for which event 2 has already occurred. Let the 
random variables denoting the durations to occurrence of events 1 and 2 be 
denoted T; and To. Then rather than (21.1), we consider the subhazard 


Pris Tt ither T° Ty <t 
h(t) = lim ras e >to To <t) 
— o0 


Then cF(t) = 1 — exp{— H: (t)}, where H; (t) = T hı (t)dt is the 
cumulative subhazard. 


To motivate the expression for CIF (t), note that in the absence of 
competing risks, we have the following relationships: the cumulative hazard 
function H(t) = f h(t)dt, which is related to the survivor function by 
H(t) = — ln S(t). It follows that S(t) = exp{—H (t) } and hence the failure 
function 1 — S(t) = 1 — exp{—H(t)}. 


A semiparametric competing risks version of the Cox PH model defines 


the subhazard function to be 


hy (ti|xi) = hijo(ti) x exp(x;8) 


where hy o(t) is the baseline subhazard. 


As before, we are interested in duration to a full-time job in which case 


censor1 = 1 (for 1,073 cases). We consider the competing risk that the 


person obtains a job that is part-time (censor2 = 1 in 339 cases) or obtains a 
job but subsequently leaves it, and we do not know whether it is part-time or 


full-time (censor3 = 1 in 549 cases). The remaining persons were still 


unemployed (censor4 = 1 in 1,255 cases). 


The PH competing risks model can be fit using the stcrreg command. We 


obtain 


. * Semiparametric competing risks example - full-time job versus not full-time 


. qui generate event = 1 if censori == 


. qui replace event = 2 if (censor2 == 1 | censor3 == 1) 


. qui stset spell, fail(event=1) 


. stcrreg ui logwage, compete(event == 2) nolog vce(robust) 


Failure _d: event== 


Analysis time _t: spell 


Competing-risks regression 


. of subjects 


Wald chi2(2) 
Prob > chi2 


3,343 
3,343 
1,073 

913 
1,357 


164.45 
0.0000 


interval] 


Failure event: event == 1 
Competing event: event == 2 
Log pseudolikelihood = -8155.038 
Robust 
it SHR std. err. Z 
ui . 5098238 .0305227 -11.25 
logwage 1.552715 .0796558 8.58 


. 5732986 
1.716956 


The subhazard-ratio coefficients in column sR are interpreted similarly to 
the hazard ratios for the Cox PH model. In particular, because the sHR for ui 


is much less than one, the hazard of finding a full-time job is much lower for 
persons receiving unemployment insurance, after controlling for logwage and 
the competing risk of finding a job that is not full-time. 


The postestimation command stcurve, cif plots the cumulative 
incidence function. To obtain the CIF curve for the raw data, one can first 
estimate using the stcrreg command with the only regressor an intercept. 


21.5 Fully parametric regression models 


Fully parametric models have two advantages over the Cox PH model. First, 
they enable estimation of the conditional mean length of an unemployment 
spell and how it changes as regressors change. Second, they can provide 
more precise estimates of the hazard function and for both censored and 
uncensored observations, though different models place different restrictions 
on the possible shape of the hazard function. 


At the same time, fully parametric models do require correct 
specification of the conditional density in order for parameter estimates to be 
consistent (the one exception being the exponential model if there is no 
censoring). 


In the following examples, we use option vce (robust) to obtain 
heteroskedastic—robust standard errors for comparison with those from the 
Cox PH model. Note, however, that consistent parameter estimation in all 
fully parametric models, aside from the exponential with uncensored data, 
requires that all aspects of the distribution be correctly specified. 


21.5.1 The streg command 


The syntax for the streg command is 
streg | indepvars | lif] lin] [ ; options | 


Note that indepvars is just a list of the regressors. It is not necessary to 
provide the names of the dependent variable and any censoring variable 
because these are already declared in the stset command. 


The censoring is assumed to be independent, meaning that after one 
conditions on any regressors, the censoring mechanism is independent of the 
duration process for the dependent variable t. The probability of being 
censored at time ¢ is simply Pr(T > t). 


The main option is the distribution() option (abbreviated dist () ), 
which allows the following parametric models: exponential, generalized 


gamma, Gompertz, inverse-Gaussian, loglogistic, lognormal, and Weibull 
distributions. 


The time option, for the exponential and Weibull models, allows 
interpretation of the model as an accelerated failure-time (AFT) model rather 
than as a PH model. 


The main postestimation commands predict, margins, and stcurve are 
illustrated in this section. 


In the simplest case of single-spell single-record data, the log likelihood 
is similar to that for the tobit model, except that censoring is from above. Let 
t; be the spell length, either completed or not completed, the binary indicator 
d; = 1 if the spell is completed, and f(t;) and F (t;) denote the uncensored 


probability distribution function and c.d.f. Then the density for the 7th 
observation is f(t;)% {1 — F(t;)}4~% leading to log likelihood 


In L = X [di ln f(t;) + (1 — di) n{1 — F(t) Y} 


w=1 


21.5.2 Parametric Weibull model 


A commonly used parametric model is the Weibull distribution, with density 


= vat?! exp(—7t*) (21.4) 


The mean E(t) = y~!/°T(a~! + 1), where [(-) is the gamma function. 


The corresponding hazard function is 


h(t) = yat?"? 


It follows that if a > 1, then the hazard is increasing in t; if a < 1, then the 
hazard is decreasing; and if q = 1, then the hazard is constant. When q = 1, 
the Weibull distribution reduces to the exponential distribution. As seen 
below, the model is too restrictive for these data because it restricts the 
hazard function to be monotonic. 


A regression model is obtained by specifying y = exp(x’). Then the 
Weibull model is a special case of the Cox PH model that defines the baseline 
hazard function to have the specific functional form ho(t) = at®~?. 


The Weibull model with independent censoring can be fit with the 
dist (weibull) option of the streg command. We obtain 


* Parametric Weibull regression 
. stset spell, fail(censor1=1) 


Survival-time data settings 


Failure event: censor1== 
Observed time interval: (0, spell] 
Exit on or before: failure 


3,343 total observations 
O exclusions 


3,343 observations remaining, representing 
1,073 failures in single-record/single-failure data 
20,887 total analysis time at risk and under observation 


At risk from t = 0 
Earliest observed entry t = 0 
Last observed exit t = 28 


. streg ui logwage, nohr vce(robust) dist(weibull) nolog 


Failure _d: censori== 


Analysis time _t: spell 
Weibull PH regression 
No. of subjects = 3,343 Number of obs = 3,343 


No. of failures = 1,073 
Time at risk = 20,887 
Wald chi2(2) = 281.34 
Log pseudolikelihood = -2842.8973 Prob > chi2 = 0.0000 
Robust 
_t Coefficient std. err. Z P>|z| [95% conf. interval] 
ui -1.14338 . 0688903 -16.60 0.000 -1.278402 -1.008357 
logwage -4107721 .0587269 6.99 0.000 . 2956695 - 5258746 
_cons -4.790933 . 3361208 -14.25 0.000 -5.449718 -4.132149 


/ln_p .0613705 .0182759 3.36 0.001 .0255504 .0971906 
p 1.063293 .0194326 1.02588 1.10207 
1/p . 9404747 .017188 .907383 . 9747733 


The reported parameter p equals a, so if p > 1, the hazard is increasing. A 
test that q = 1 is the same as a test that In a = 0, and from the output, this 
test has z = 3.36 and p = 0.001, so we strongly reject the hypothesis of a 
constant hazard. 


From output not reported, the default standard errors of ui, Logwage, 
_cons, and p are, respectively, 0.0631, 0.0531, 0.3102, and 0.0247, roughly 


10% less than the heteroskedastic-robust standard errors. 
21.5.3 Prediction and MEs 


Unlike the Cox PH estimator, most parametric models can predict median and 
mean survival times. 


The default median time option of the postestimation predict command 
predicts the median survival time; the mean time option predicts the mean 
survival time; and the median 1ntime and mean Int ime options predict the 
natural logarithm of these quantities. The hr and xb options predict the 
relative hazard and x’. 


The postestimation margins command can use these preceding 
predictions, so MEs for these quantities can be computed. 


Additional options of the postestimation predict command compute the 
hazard (option hazard), survival probability (option surv), and various 
residuals: martingale-like (mgale), Cox—Snell (csne11), and deviance 
(deviance); see [ST] streg postestimation. 


For the Weibull model, we predict the expected length of the jobless 
spell for each person in the sample, using the mean time option of the 
predict command. 


. * Parametric Weibull model: Prediction of the conditional mean duration time 
. qui streg i.ui logwage, nohr vce(robust) dist(weibull) nolog 


. predict muspell, mean time 
. qui generate completedspell = spell if censori== 
. qui generate mucompletedspell = muspell if censor1== 


. Summarize muspell completedspell mucompletedspell 


Variable Obs Mean Std. dev. Min Max 
muspell 3,343 20.34527 9.926651 4.690547 91.0024 
completeds”™”1l 1,073 4.941286 4.890527 1 27 
mucomplete~1 1,073 18.14698 9.774751 4.690547 44.60316 


The average of these predictions is 20.35 periods, much higher than the 
average across complete and incomplete spells of 6.25 periods. There is 


considerable variation across individuals in the predicted spell lengths, from 


4.69 periods to 91.00 periods. 


We then compute the average marginal effects (AME) for the conditional 


mean. 


. * Parametric Weibull model: AMEs for the conditional mean duration time 


. margins, dydx(*) predict(mean time) 


Average marginal effects 
Model VCE: Robust 


Expression: Predicted mean _t, predict(mean time) 
dy/dx wrt: 1.ui logwage 


Delta-method 


dy/dx std. err. Zz P>lz| 
1.ui 19.33181 1.565395 12.35 0.000 
logwage -7.859799 1.275219 -6.16 0.000 


Number of obs = 3,343 


[95% conf. 


16.26369 
-10.35918 


interval] 


22.39992 
-5.360415 


Note: dy/dx for factor levels is the discrete change from the base level. 


The average across individuals of having unemployment insurance is to 
increase unemployment duration by 19.33 periods or 38.66 weeks compared 


with someone with no unemployment insurance. 


21.5.4 Proportional hazards models 


We fit a range of parametric models using the streg command. The list does 


not include the generalized gamma model because for these data, the 


estimates did not converge. We consider four models that lead to a hazard 
function that is of the PH form h(t|x) = ho(t) x exp(x’). 


. * Parametric models (plus Cox) with PH parameterization 
. qui streg ui logwage, dist(exponential) nohr vce(robust) 


. qui stcurve, hazard title("Exponential") 

. estimates store Exponential 

. qui streg ui logwage, dist(weibull) nohr vce(robust) 
. qui stcurve, hazard title("Weibull") 

. estimates store Weibull 

. qui streg ui logwage, dist(gompertz) nohr vce(robust) 
. qui stcurve, hazard title("Gompertz") 

. estimates store Gompertz 

. qui stcox ui logwage, nohr vce(robust) 

. gui stcurve, hazard title("Cox") 


. estimates store Cox 


The resulting parameter estimates are 


. * Estimates of parametric models with PH parameterization 
. estimates table Exponential Weibull Gompertz Cox, eq(1) b(%/11.3f) se 
> stats(1l aic bic) 


Variable | Exponential Weibull Gompertz Cox 
#1 
ui -1.113 -1.143 -1.094 -1.007 
0.066 0.069 0.065 0.061 
logwage 0.407 0.411 0.404 0.399 
0.057 0.059 0.056 0.053 
_cons -4.653 -4.791 -4.577 
0.323 0.336 0.317 
/1n_p 0.061 
0.018 
/gamma -0.013 
0.006 
Statistics 
11 -2846 . 262 -2842.897 -2844.103 -7847 . 299 
aic 5698.523 5693.795 5696.206 15698.598 
bic 5716.867 5718.253 5720.664 15710.827 


Legend: b/se 


The parameter estimates and their standard errors for variables ui, logwage, 
and cons are comparable across the models. The auxiliary parameters in all 


models, such as 1n_p for the Weibull model, are all statistically significant. 
The Weibull model provides the best fit, though the exponential with the 
fewer parameters is preferred if Bayesian information criterion is used as the 
criterion. 


21.5.5 Accelerated failure-time models 


Not all hazard functions are of the PH form. An alternative class of models 
for duration data is that of AFT models. 


An AFT model arises by modeling ]n ¢ rather than t. Then a regression 
model is 


Int=x’B+u 


where different distributions for u lead to different parametric models. 


The model implies t = v x exp(x’@), where v = e”. It follows that the 
hazard function partitions because h(t|x) = ho(v) x exp(x’), where the 
baseline hazard ho(v) varies with the distribution of u but clearly does not 
depend on t. Now v = t x exp(—x’3), and making this substitution into the 
expression for h(t|x) yields the following hazard function for dependent 
variable t, conditional on regressors x, for an AFT model: 


h(t|x) = ho{t exp(—x'B)} x exp(x’) 


This is an acceleration of the baseline hazard if exp(—x’@) > 1 anda 
deceleration otherwise. 


Not all hazard functions are of the AFT form. The lognormal and 
loglogistic yield hazard functions of this form, and the exponential and 
Weibull yield hazards that can be expressed as both PH and, with some 
reparameterization, as AFT. 


We present four models that lead to a hazard function that is of the AFT 
form; the models are defined in the manual entry for streg. The option time 
is added for the exponential and Weibull models so that results are given for 
the AFT parameterization rather than the default PH parameterization. We 
obtain 


. * Parametric models with AFT parameterization 
. qui streg ui logwage, dist(loglogistic) vce(robust) 


. gui stcurve, hazard title("Loglogistic") 
estimates store Loglogistic 

. qui streg ui logwage, dist(lognormal) vce(robust) 

. qui stcurve, hazard title("Lognormal") 
estimates store Lognormal 

. gui streg ui logwage, dist(exponential) nohr vce(robust) time 
estimates store ExponAFT 

. qui streg ui logwage, dist(weibull) nohr vce(robust) time 


estimates store WeibullAFT 


The resulting estimates are 


. * Estimates of parametric models with AFT parameterization 
. estimates table Loglogistic Lognormal ExponAFT WeibullAFT, eq(1) b(/11.3f) 
> se stats(1l aic bic) 


Variable Loglogistic Lognormal ExponAFT WeibullAFT 
#1 
ui 1.243 1.213 1.113 1.075 
0.062 0.056 0.066 0.064 
logwage -0.409 -0.381 -0.407 -0.386 
0.056 0.052 0.057 0.055 
_cons 4.097 3.980 4.653 4.506 
0.321 0.300 0.323 0.313 
/lngamma -0.301 
0.019 
/lnsigma 0.237 
0.019 
/1n_p 0.061 
0.018 
Statistics 
11 -2774.893 -2726.227 -2846 . 262 -2842.897 
aic 5557.787 5460.453 5698.523 5693.795 
bic 5582.245 5484.912 5716.867 5718.253 


Legend: b/se 


Compared with the signs for PH models, the signs for the fitted coefficients 
are reversed because for the AFT models, the regressors enter via exp(—x’ 8) 
rather than exp(x’@). Aside from this sign reversal, the AFT and PH estimates 
for the exponential model are identical. For the Weibull model, the AFT 
estimates of 8 equal the PH estimates divided by p, for example, 

— 1.1434/1.0633 = 1.075. The parameter estimates for variable ui differ 
across the models, though the standard errors are similar across the models. 
The auxiliary parameters in all models, such as /1ngamma for the loglogistic 
model, are all statistically significant. 


The lognormal model clearly provides the best fit, across both AFT and PH 
models. The loglogistic model is the next-best model. 


21.5.6 Comparison of hazard functions 


Figure 21.8 presents the fitted hazard functions from these parametric 
models, evaluated at the mean of the regressors, as well as the baseline 
hazard for Cox PH. 
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Figure 21.8. Hazard rate from various survival models 


The exponential restricts the hazard function to be constant. The Weibull and 
Gompertz restrict the hazard function to be monotonic (increasing, 
decreasing, or constant). The loglogistic and lognormal allow for an inverted 
u-shaped hazard function, similar to that obtained earlier for the more 
flexible Cox PH model. As already noted, the loglogistic and lognormal 
provide the best fit. 


21.5.7 Comparison of mean duration and associated AMEs 


The conditional mean duration can be computed for four of the parametric 
models. We obtain 


. * Predicted means and AMEs for four parametric models 
. foreach dist in exponential weibull loglogistic lognormal { 


M 


21.5.8 Models with frailty 


2. 


3 
4 
5. 
6. 
7 
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2 
E 
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1 
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qui streg i.ui logwage, dist(~dist~) vce(robust) 


display "Model “dist~:" _col(21) "Ave mean time =" %6.2f r(b) [1,1] 


r(b) [1,1] %6.2f " AME logwage= " r(b) [1,2] 


.10 
logwage= 21.609415 
.35 
logwage= 19.331806 
. 36 
logwage= 47.253356 
.64 
logwage= 33.028682 


qui margins, predict (mean time) 
qui margins, dydx(*) predict (mean time) 
dis _col(21) "AME ui =" %6.2f 
} 
el exponential: Ave mean time = 2 
AME ui = 0.00 AM 
Model weibull: Ave mean time = 2 
AME ui = 0.00 AM 
Model loglogistic: Ave mean time = 4 
AME ui = 0.00 AM 
Model lognormal: Ave mean time = 3 
AME ui = 0.00 AM 


The difference in mean duration time is much greater than the difference 
in the estimates of 8 across the models. The biggest prediction is for the 
loglogistic model, which predicts an average duration across the sample of 
44.36 periods, or 88.72 weeks, and one that is on average 55.15 periods 
higher for those with unemployment insurance. 


The parametric models require correct specification of the density for 
consistent estimation. One possible misspecification is that while the correct 
distribution is specified, the included regressors do not completely capture 
individual heterogeneity. Such neglected or unobserved heterogeneity, called 
frailty in the survival literature, may spill over to an erroneous fitted model 
of duration dependence. 


Frailty is modeled as a multiplicative effect in the hazard function, so 
that the hazard function h(t|x) becomes a x h(t|x), where a is independent 
and identically distributed with either the gamma distribution or the inverse- 
Gaussian distribution. ML estimates are based on the density that is obtained 
by integrating out a. 


To demonstrate the consequences of neglected heterogeneity, we 
introduce frailty of inverse-Gaussian form to the Weibull model, using the 
frailty(invgau) option. We obtain 


. * Weibull model with gamma frailty 
. streg ui logwage, dist(weibull) frailty(invgau) nolog nohr vce(robust) 


Failure _d: censori== 


Analysis time _t: spell 


Weibull PH regression 
Inverse-Gaussian frailty 


No. of subjects = 3,343 Number of obs = 3,343 
No. of failures = 1,073 
Time at risk = 20,887 

Wald chi2(2) = 390.12 

Log pseudolikelihood = -2763.0423 Prob > chi2 = 0.0000 

Robust 

_t | Coefficient std. err. Zz P>|z| [95% conf. interval] 

ui -1.920912 .0985825 -19.49 0.000 -2.11413 -1.727693 

logwage .6401561 .0872424 7.34 0.000 . 469164 .8111481 

_cons -5.929082 .502098 -11.81 0.000 -6.913176 -4.944988 

/1n_p .5138361 .0201119 25.55 0.000 -4744176 .5532546 

/lintheta 1.973946 .0607123 32.51 0.000 1.854952 2.09294 

p 1.671692 . 0336208 1.607078 1.738903 

1/p .5981964 .0120308 .5750751 . 6222474 

theta 7 . 199027 . 4370692 6.391391 8.108717 


The model fit is much better. The log likelihood has increased from — 
2,842.9 to —2,763.0 and is now higher than for all models other than the 


loglogistic model. 


To visualize the effect of neglected frailty on the fitted hazard rate, we 


compare the fitted hazard for an individual (the alpha1 option of the 


stcurve command) with that obtained after integrating out the frailty 


parameter (the unconditional option). 


* Hazard curves conditional and unconditional on alpha 
qui stcurve, hazard alphal 
> title("Conditional on {&alpha}=1") 


qui stcurve, hazard unconditional 


> title("Unconditional on {alpha}") 
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Figure 21.9. Weibull hazard controlling for frailty 


Figure 21.9 presents the graph. After we control for frailty (neglected 
heterogeneity), the hazard rate now has an inverted u-shape similar to that 
obtained by the loglogistic, lognormal, and Cox PH models. 


The estimated coefficients and associated standard errors are generally 
larger for the frailty model than those for the basic Weibull model. At the 
same time, there is relatively little difference in the AME at the median for the 


two models. 


. * AME at median for Weibull without and with frailty 

. qui streg 1.ui logwage, dist(weibull) vce(robust) 

. margins, dydx(*) 

note: option continuous implied because a factor with only one level was specified 
in option dydx(). 

Average marginal effects Number of obs = 3,343 

Model VCE: Robust 

Expression: Predicted median _t, predict() 


dy/dx wrt: 1.ui logwage 


Delta-method 


dy/dx std. err. z P>Izl [95% conf. interval] 
1.ui 15.87554 1.361097 11.66 0.000 13.20784 18.54324 
logwage -5.703467 .9115946 -6.26 0.000 -7 .490159 -3.916774 


. qui streg 1.ui logwage, dist(weibull) frailty(invgau) vce(robust) 


. margins, dydx(*) 

note: option continuous implied because a factor with only one level was specified 
in option dydx(). 

Average marginal effects Number of obs = 3,343 

Model VCE: Robust 


Expression: Predicted median _t, predict() 
dy/dx wrt: 1.ui logwage 


Delta-method 


dy/dx std. err. z P>|z| [95% conf. interval] 
1.ui 16.81911 1.299592 12.94 0.000 14.27195 19.36626 
logwage -5.605074 .8508579 -6.59 0.000 -7.272725 -3.937423 


21.6 Multiple-records data 


The dataset studied so far is of single-record form. Datasets of multiple- 
record form are necessary if some regressors vary over the length of the spell 
or if one wishes to fit a discrete-time hazards model. 


In this section, we show how to expand a single-record dataset to a 
multiple-record dataset and demonstrate that this data format yields the same 
estimates as the Cox PH model. 


21.6.1 Expanding the dataset 


We wish to expand the dataset with only one observation per individual to 
one of panel form with data for each period that the individual’s spell is 
active. This can be done using the expand command, but it is simpler to use 
the stsp1it command with the option every (1). 


We obtain 


. * Expand data to have # observations per person = spell length 
. qui use mus22imccall, clear 


. generate id = _n // Need an individual identifier 


stset spell, fail(censori=1) id(id) // stset must include id() option 
Survival-time data settings 


ID variable: id 
Failure event: censor1== 
Observed time interval: (spell[{_n-1], spell] 
Exit on or before: failure 


3,343 total observations 
O exclusions 


3,343 observations remaining, representing 

3,343 subjects 

1,073 failures in single-failure-per-subject data 
20,887 total analysis time at risk and under observation 


At risk from t = (0) 
Earliest observed entry t = (0) 
Last observed exit t = 28 


stsplit t, every(1) 
(17,544 observations (episodes) created) 


Recall that there were 3,343 spells with an average length of 6.248 periods 
leading to a total of 3343 x 6.248 = 20887 periods at risk. The stsplit 
command created 17,544 additional observations for a total of 

3343 + 17544 = 20887 observations at risk. 


In addition to expanding the dataset, the stsp1it command creates 
several new variables and changes some existing variables. Focusing on the 
key variables of interest for this analysis, we have 


. * Data created by stsplit 
summarize id spell censor1 ui logwage _* t 


Variable Obs Mean Std. dev. Min Max 
id 20 , 887 1594.551 964.6319 1 3343 
spell 20 , 887 6.14296 5.184553 1 28 
censor1 3,343 . 3209692 .4669188 (0) 1 
ui 20,887 . 7062766 .4554776 (0) 1 
logwage 20,887 5.712025 .5376671 2.70805 7 .600402 
_st 20,887 1 (0) 1 1 

_d 20,887 .0513717 . 2207599 (0) 1 

_t 20 , 887 6.14296 5.184553 1 28 

_to 20 , 887 5.14296 5.184553 (0) 27 

t 20,887 5.14296 5.184553 (0) 27 


. list id spell censori ui logwage _* t if id==9, clean 


id spell censor1 ui logwage _st dad t _tO t 
67. 9 1 1 5.28827 1 0 1 o o0 
68. 9 2 1 5.28827 1 0 2 1 1 
69. 9 3 1 5.28827 1 0 3 2 2 
70. 9 4 1 5.28827 1 0 4 3 3 
71. 9 5 1 5.28827 1 0 5 4 4 
72. 9 6 . 1 5.28827 1 0 6 5 5 
73. 9 7 1 1 5.28827 1 1 7 6 6 

. list id spell censorl ui logwage _* t if id==10, clean 

id spell censor1 ui logwage st dad t t0 t 
74. 10 1 1 5.28827 1 0 1 o o0 
75. 10 2 1 5.28827 1 0 2 1 1 
76. 10 3 1 5.28827 1 0 3 2 2 
TT. 10 4 ; 1 5.28827 1 0 4 3 3 
78. 10 5 0 1 5.28827 1 0 5 4 4 


The spell length variable spe11 now takes value 1, 2,... to the observed 
spell length. The censoring variable censor1 is set to missing until the last 
time period in the spell, at which point it takes its original value of 1 if the 
spell actually ended with employment and 0 if the spell continued. The 
observations for the regressors ui and logwage are simply duplicated, so the 
regressors are time invariant. The new variables created include a time 
variable t equal to 0,1,2,.... 


The related stjoin command constructs a single-record dataset from a 
multiple-record dataset. 


Estimation with multiple-record data 


The stcox and streg commands with this multiple-record data with time- 
invariant regressors yield the same results as those obtained earlier using the 
single-spell data, though computation time will be longer. For example, the 
stcox command yields 


* Cox PH with multiple-records data gives same results as single-record data 
stcox ui logwage, nolog vce(robust) nohr 


Failure _d: censori== 
Analysis time _t: spell 


ID variable: id 


Cox regression with Breslow method for ties 


No. of subjects = 3,343 Number of obs = 20,887 
No. of failures = 1,073 
Time at risk = 20,887 

Wald chi2(2) = 287.10 


Log pseudolikelihood = -7847.2989 Prob > chi2 = 0.0000 
(Std. err. adjusted for 3,343 clusters in id) 


Robust 
_t Coefficient std. err. Z P>lz| [95% conf. interval] 
ui -1.007489 . 0609082 -16.54 0.000 -1.126866 -.8881108 
logwage . 3991956 .0525997 7.59 0.000 . 296102 . 5022892 


The coefficient estimates, their standard errors, and the fitted log likelihood 
are identical to those obtained previously using the single-record data. 
Because there are now multiple observations per individual, the reported 
robust standard errors are cluster—robust with clustering on id. 


21.7 Discrete-time hazards logit model 


An alternative method for analyzing duration data is the discrete-time 
hazards model, which fits a binary outcome model for transitions. 


Suppose we observe an individual ; for eight periods, with no transition 
occurring in the first seven periods and in the eighth period either a 
transition occurs (a noncensored observation) or a transition does not occur 
(a censored observation). Then we have a binary dependent variable 
Yit, t= 1,...,8, that equals 0 in the first seven periods and equals either 1 
(transition) or 0 (no transition) in the final period. 


The multiple-record dataset is already in suitable form, though we need 
to create the dependent variable for the discrete time analysis. This variable 
takes value 1 if a transition occurred and 0 if it did not. 


. * Discrete-time hazards: Create indicator variable for getting employed 
. generate demploy = 0 


. replace demploy = 1 if censori == 
(1,073 real changes made) 


We expect the variable to equal 1 for 1,073 observations because the stset 
command output stated that there were 1,073 failures. 


We now fit a logit model with regressors ui and logwage as before. As 
already stated, in the current example, these regressors are time invariant, 
but more generally, the regressors can be time varying. Discrete-time 
hazards models additionally include a time trend. Here we include a 
quadratic in time. 


Because multiple observations have been created per individual, valid 
statistical inference requires the use of cluster—robust standard errors, with 
clustering on the individual. 


We obtain 


. * Discrete-time hazards: logit with quadratic time trend 
. logit demploy ui logwage c.t##c.t, vce(cluster id) nolog 


Logistic regression Number of obs = 20,887 
Wald chi2(4) = 331.19 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -4027.6351 Pseudo R2 = 0.0479 


(Std. err. adjusted for 3,343 clusters in id) 

Robust 
demploy | Coefficient std. err. Zz P>lz| [95% conf. interval] 
ui -1.100934 .0669661 -16.44 0.000 -1.232185 -.969683 
logwage .4401167 . 0586368 7.51 0.000 .3251907 . 5550427 
t -.1067251 .0163839 -6.51 0.000 -.138837 -.0746133 
c.t#c.t .0041279 . 0008648 4.77 0.000 .002433 0058229 
_cons -4.491509 . 3323894 -13.51 0.000 -5.14298 -3.840037 


The results are qualitatively similar to those from the stcox command with 
the nonr option. The probability of obtaining a job increases with the wage 
in the previous job and decreases with unemployment insurance status. This 
probability is initially decreasing in time before increasing after 13 periods ( 
~ —0.10673/(—2 x 0.004128)). The margins command can be used to 
compute MES. 


21.7.1 Relationship to Cox PH model 


The Cox PH model can be shown to be equivalent to a discrete-time hazards 
complementary log-log model where dummy variables for each time period 
are included as additional regressors; see, for example, Cameron and 
Trivedi (2005, chap. 17.10). 


We obtain 


* Discrete-time hazards: cloglog with time dummies 
cloglog demploy ui logwage i.t, vce(cluster id) nolog 


note: 22.t != 0 predicts failure perfectly; 
22.t omitted and 69 obs not used. 

note: 23.t != 0 predicts failure perfectly; 
23.t omitted and 60 obs not used. 

note: 24.t != 0 predicts failure perfectly; 
24.t omitted and 58 obs not used. 

note: 27.t != 0 predicts failure perfectly; 
27.t omitted and 4 obs not used. 


Complementary log-log regression Number of obs = 20,696 
Zero outcomes = 19,623 
Nonzero outcomes = 1,073 
Wald chi2(25) = 438.48 
Log pseudolikelihood = -3936.0639 Prob > chi2 = 0.0000 


(Std. err. adjusted for 3,343 clusters in id) 
Robust 
demploy | Coefficient std. err. Zz P>lz| [95% conf. interval] 
ui -1.050797 . 0636478 -16.51 0.000 -1.175545 -.92605 
logwage .4191098 .0552535 7.59 0.000 .3108149 . 5274046 
t 
1 -. 2674352 . 094283 -2.84 0.005 -.4522264 - .082644 
2 -.4300344 .1073308 -4.01 0.000 - . 6403989 -.21967 
3 -.9467419 . 1445455 -6.55 0.000 -1.230046 - . 663438 
4 -.166659 . 1125021 -1.48 0.139 -.3871592 .0538411 
5 -1.118788 .1852016 -6.04 0.000 -1.481777 = -.7558001 
6 . 0064278 . 1220951 0.05 0.958 -. 2328743 . 2457299 
T -1.491213 . 2643466 -5.64 0.000 -2.009323 -.9731037 
8 -.6015566 . 1831592 -3.28 0.001 -.9605421 -.2425712 
9 -2.854551 . 5804503 -4.92 0.000 -3.992213 -1.71689 
10 -.590193 . 2043875 -2.89 0.004 -.9907852 -. 1896008 
11 -1.739855 .3828346 -4.54 0.000 -2.490197 -.9895133 
12 -. 3692626 . 208614 -1.77 0.077 -.7781385 .0396134 
13 0298682 . 1922302 0.16 0.877 - . 346896 . 4066324 
14 -.1760333 .2389794 -0.74 0.461 - . 6444243 . 2923578 
15 - .6208792 . 3214599 -1.93 0.053 -1.250929 .0091705 
16 -. 5828672 . 3576104 -1.63 0.103 -1.283771 . 1180363 
17 -.5503648 .3822622 -1.44 0.150 -1.299585 . 1988553 
18 -.9997678 .5044911 -1.98 0.048 -1.988552 -.0109835 
19 -1.151688 . 5843593 -1.97 0.049 -2.297011 - . 0063648 
20 -.6933875  .5080862 -1.36 0.172 -1.689218 . 3024431 
21 -.3726412 .5100573 -0.73 0.465 -1.372335 . 6270527 
22 O (empty) 
23 O (empty) 
24 O (empty) 
25 -.5521873 . 7122656 -0.78 0.438 -1.948202 . 8438277 
26 .8110057 . 468242 1.73 0.083 -.1067317 1.728743 
27 O (empty) 
_cons -4.30813 .3167992 -13.60 0.000 -4.929045 -3.687215 


The omitted dummies arise because there were no transitions in some 


periods and to avoid the dummy variable trap. 


The coefficient estimates of — 1.051 and 0.419 are quite close to 
— 1.007 and 0.399 for the Cox PH model. The corresponding standard errors 
are 0.064 and 0.055 compared with 0.061 and 0.053. The coefficient 
estimates are also similar to those for the preceding logit model, which had a 
quadratic in time. 


It is more common to use the logit model than the complementary log— 
log model. The coefficient estimates of the two models are not directly 
comparable but are very similar in this example. From output not listed, logit 
estimation of the same model yielded coefficient estimates of — 1.089 and 
0.439 with corresponding standard errors of 0.067 and 0.059. Furthermore, 
the estimated AMEs of the two binary outcome models are very close. 


. * Compare AMEs for discrete-time hazards cloglog and logit 

. qui cloglog demploy 1.ui logwage i.t, vce(cluster id) nolog 

. margins, dydx(1.ui logwage) 

note: option continuous implied because a factor with only one level was specified 
in option dydx(). 

Average marginal effects Number of obs = 20,696 

Model VCE: Robust 

Expression: Pr(demploy), predict() 

dy/dx wrt: 1.ui logwage 


Delta-method 


dy/dx std. err. z P>lz| [95% conf. interval] 
1.ui -.0522352 .0036189 -14.43 0.000 -.059328 -.0451423 
logwage .020834 .0028167 7.40 0.000 .0153133 . 0263546 


. qui logit demploy 1.ui logwage i.t, vce(cluster id) nolog 


. margins, dydx(1.ui logwage) 
note: option continuous implied because a factor with only one level was specified 
in option dydx(). 


Average marginal effects Number of obs = 20,696 
Model VCE: Robust 


Expression: Pr(demploy), predict () 
dy/dx wrt: 1.ui logwage 


Delta-method 
dy/dx std. err. z P>Izl [95% conf. interval] 


1.ui -.0518829 0035691 -14.54 0.000 -.0588781 - .0448876 
logwage 0209271 0028518 7.34 0.000 .0153376 0265166 


We conclude that for these data, the Cox PH model, the complementary log— 
log model, and the logit discrete-time hazards model yield similar results. 


21.8 Time-varying regressors 


The stcox and streg commands allow estimation of models with regressors 
whose values change during the course of the spell. In that case, multiple- 
records data need to be used. 


As an example of a single-spell model with time-varying regressors, we 
adjust the current multiple-records data by creating the variable tvlogwage, 
which equals the original time-invariant variable logwage, and performing a 
random draw each time period from the N(0, 0.5?) distribution. 


Estimation of the Cox PH model with the time-varying regressor 
tvlogwage rather than logwage yields 


. * Cox PH regression with time-varying regressors 
. generate tvlogwage = logwage + 0.5*rnormal(0,1) // Create time-varying x 


. stcox ui tvlogwage, vce(robust) nohr nolog 


Failure _d: censor1== 
Analysis time _t: spell 


ID variable: id 


Cox regression with Breslow method for ties 


No. of subjects = 3,343 Number of obs = 20,887 
No. of failures = 1,073 
Time at risk = 20,887 

Wald chi2(2) = 259.82 


Log pseudolikelihood = -7865.4094 Prob > chi2 = 0.0000 
(Std. err. adjusted for 3,343 clusters in id) 


Robust 
_t Coefficient std. err. Z P>|z| [95% conf. interval] 
ui -.9633531 .0599822 -16.06 0.000 -1.080916 -.8457902 
tvlogwage .1711999 .0385886 4.44 0.000 .0955677 . 2468321 


The coefficient of ui is little changed, while the coefficient of tvlogwage is 
0.171 compared with 0.399 for logwage. 


21.9 Clustered data 


Section 13.9 presented in some detail various models and methods that can 
be applied when observations in the same cluster are correlated and 
observations in different clusters are uncorrelated. That discussion was 
illustrated using the Poisson model, and much of it carries over to duration 
data. 


When there is no censoring, the simplest approach is to estimate the 
quasi-ML estimator of the exponential model with exponential conditional 
mean, using the command glm y x, family(gamma) scale(1) link(log) 
with the vce (cluster clustvar) option. This maintains the assumption 
that the conditional mean for individual ; in cluster g is 
E(Yig|Xig) = exp(x;,), the essential condition for consistency. But it 
provides corrected standard errors that adjust for the loss in precision that 
arises because of observations no longer being independent within cluster. 
Similarly, one could use nonlinear instrumental-variables methods for 
models with endogenous regressors, along with the vce (cluster 
clustvar) option. Potentially, more efficient estimates can be obtained by 
nonlinear feasible generalized least-squares estimation assuming 
equicorrelation within cluster, using the xtgee command with options 
family (gamma), link(log), corr (exchangeable), and vce (robust). 
Fixed-effects estimation leads to consistent estimates when there are many 
observations within each cluster. The simplest method is to add i.clid as 
regressors, where clid denotes the cluster identifier variable, though this 
can become computationally infeasible if there are many clusters. 


For censored data, the Cox PH model estimator remains consistent in the 
presence of clustering, and one adds the vce (cluster clustvar) option. 
The shared() option for the stcox and streg commands introduces 
frailty (see section 21.5.8) at the cluster level rather than the individual 
level. For several parametric duration models with or without censoring, the 
xtstreg command adds cluster-specific random intercepts that are assumed 
to be normally distributed. The mest reg command generalizes this to fit 
multilevel MEs parametric survival models for the exponential, loglogistic, 


Weibull, lognormal, and gamma, with normally distributed random effects 
and random coefficients. 


21.10 Additional resources 


The [st] Stata Survival Analysis Reference Manual provides a very detailed 
exposition of methods for survival analysis, with emphasis on methods used 
in biostatistics and epidemiology. The Stata Press book by Cleves, Gould, 
and Marchenko (2016) covers in much more detail many of the topics in 
this chapter. The website book by Jenkins (2005) is more oriented toward 
social science researchers. 


Survival analysis is presented in many complete books, though 
relatively few are written for economists. Econometrics references include 
Lancaster (1990) and Cameron and Trivedi (2005, chaps. 17—19). 


21.11 Exercises 


1. Consider the Weibull distribution with y = 0.02 and q = 2. Given the 
formula for the Weibull density in (21.4), show that the c.d.f. is 
F(t) = 1 — exp(—0.02 x t?). Generate 10,000 uncensored Weibull 
observations using commands set obs 10000 and set seed 10101 and 
t=sqrt (-1n(1-runiform())/0.02). Explain how the inverse 
transformation method is being used here. Verify that the mean of t is 
close to the theoretical mean given after (21.4). Given the formulas for 
f(t) and F(t), derive the formulas for the survivor, hazard, and 
integrated hazard functions for this Weibull example. Given these 
formulas, generate corresponding variables ft, st, ht, and Ht, and plot 
each of these against t. Now, give command stset t, and obtain a plot 
of f(t) using command kdensity t, a plot of S(t) using command sts 
graph, survival ci, and similar plots of h(t) and H(t). Compare 
these graphs to the corresponding plots of ft, st, ht, and Ht against t, 
and comment on the precision of estimation. Finally, give command 
streg, dist (weibull) nohr. Do you obtain the expected estimates? 

2. Now, introduce a regressor in the example of question 1. Generate 
10,000 uncensored Weibull observations with setobs10000 and 
setseed10101 and gen x=rnormal (0,1) and gen t=sgrt (-ln(1- 
runiform())/exp(1n(0.02)+0.1*x)). Give commands stset t and 
streg x, dist (weibull) nohr and stcox x, nohr. Do you get the 
expected estimates? Explain. Now, introduce random censoring using 
commands gen fail=runiform()>0.6 and stset t, failure(fail). 
Obtain Cox PH and Weibull ML estimates, and compare the estimates 
and precision to the uncensored case. Then, discretize the data using 
commands replace t=ceil (t) and stset t, failure(fail). Obtain 
Cox PH and Weibull ML estimates, and compare the estimates and 
precision to the nondiscretized case. 

3. The dataset used in this chapter includes binary variable censor4 that 
equals 1 if still jobless. Give commands use mus221mccal1 and gen 
fail=censor4==0 and stset spell, failure (fail). What fraction of 
spells are completely observed? Give command sts list, and using 
the data listed under Beg. total and Fail and Net Lost, manually 
compute the estimated survivor function and hazard function at time 2. 


Obtain a plot of the smoothed hazard function. Does the hazard appear 
to be increasing or decreasing with spell length? Consider regression 
with regressors ui, age, married, and age. Do the fitted coefficients 
obtained using the Cox PH estimator accord with your prior beliefs? 
Repeat using the Weibull ML estimator. 

. In the dataset of this chapter, the unemployment spell can end in 
several different ways. For this exercise, use the censoring variable 
censor3 that equals 1 if reemployed but leaves the job. Generate 
survival, cumulative hazard, and hazard functions estimates and 
curves, with 95% confidence intervals. Compare with those using 
censor! as the censoring variable. Then, fit and tabulate the following 
four duration models with ui and logwage as regressors: exponential; 
Weibull with gamma heterogeneity (frailty); Gompertz with gamma 
heterogeneity (frailty); and Cox PH model. Which is the preferred 
model according to the bic criterion? 

. For the same setup as example 3, create a multiple-records dataset by 
expanding the dataset using generate id= nand stset spell, 
failure (fail) id(id) andstsplit t, every(1). Verify that applying 
the stcox command to the multiple-record dataset yields the same 
results as in question 3. Create a binary variable djob that equals 1 if 
the person transitions from being jobless to having a job. Estimate a 
discrete-time hazards logit regression of djob on ui, age, married, and 
a quadratic in t. Compare your results with those from the stcox 
command. Do the discrete-time hazards logit model estimates change 
much if we use dummy variables for each time period rather than a 
quadratic in time? 


Chapter 22 
Nonlinear panel models 


22.1 Introduction 


The general approaches to nonlinear panel models are similar to those for 
linear models (see chapters 8 and 9), such as pooled, population averaged 
(PA), random effects (RE), and fixed effects (FE). 


Panel methods for nonlinear models overlap with methods for nonlinear 
models with clustered data. These methods have been presented in 
section 13.9 and in relevant sections of the various chapters for specific 
types of data. One difference is that for more efficient feasible generalized 
least-squares estimation of pooled nonlinear models with clustering, a 
natural model for within-cluster correlation was equicorrelation, whereas 
for panel data, the within correlation is likely to dampen as observations 
become further apart in time. 


We focus on short panels. Unlike the linear case, the slope parameters in 
pooled and RE models differ. More generally, results for linear models do 
not always carry over to nonlinear models, and methods used for one type 
of nonlinear model may not be applicable to another type. In particular, 
there are only a few nonlinear models for which one can consistently 
estimate parameters of FE models if the panel is a short panel, and in such 
cases, obtaining marginal effects (MEs) remains a challenge. 


Overall, our coverage is more selective in the range of models 
considered. We begin with a general treatment of nonlinear panel models. 
We then give a lengthy treatment of the panel methods for binary outcome 
and ordered outcome models, with emphasis on the logit model. Tobit, 
count data, and conditional quantile regression (QR) models are then 
covered more concisely. 


22.2 Nonlinear panel-data overview 
We assume familiarity with the material in chapter 8. We use the individual-effects 
models as the starting point to survey the various panel methods for nonlinear 


models. 


We consider nonlinear panel models for the scalar dependent variable Yit with 
the regressors Xit, where ; denotes the individual and ¢ denotes time. 


In some cases, a fully parametric model may be specified, with the conditional 
density 


F (yield, Xie) = f (Yit ai HxB Y) t=1,...,T%, t=1,...,N (221) 


where Y denotes additional model parameters such as variance parameters and a; is 
an individual effect. 


In other cases, a conditional mean model may be specified, with the additive 
effects 


E(yit|ai, Xit) = ai + 9(%,8) (22.2) 


or with the multiplicative effects 


E(yiu|ai, Xit) = ai X 9(*,8) (22.3) 


for the specified function g(-). In these models, xiz includes an intercept, so œ; is a 
deviation from the average centered on zero in (22.1) and (22.2) and centered on 
unity in (22.3). 


22.2.1 FE models 
An FE model treats @; as an unobserved random variable that may be correlated with 


the regressors Xit. In long panels, this poses no problems, aside from possible 
computational challenges if there are many individuals and hence many Qi. 


But in short panels, joint estimation of the fixed effects @1,..., ay and the 
other model parameters, 3 and possibly y, usually leads to inconsistent estimation 
of all parameters. The reason is that the N incidental parameters a; cannot be 
consistently estimated if T; is small, because there are only T; observations for each 
a. This inconsistent estimation of a; can spill over to inconsistent estimation of 8. 


For some models, one can eliminate a; by appropriate conditioning on a 
sufficient statistic for Yi1,---,Y:T;. This is the case for logit models for binary data 
and multinomial data and for the Poisson model for count data. For other models, 
aside from (22.2), it is not possible, though bias-corrected estimators have been 
developed; see section 22.4.8. For models of form (22.1), a simple alternative 
method is the correlated random-effects (CRE) model presented in section 22.2.4. 


Dynamic models with individual fixed effects can be fit in some cases, most 
notably, conditional mean models with additive or multiplicative effects as in (22.2) 
and (22.3). The methods are qualitatively similar to those in the linear case. Stata 
does not currently provide official commands to fit dynamic nonlinear panel 
models; section 22.4.13 presents a community-contributed command for the logit FE 
model. 


22.2.2 RE models 


An RE model treats the individual-specific effect a; as an unobserved random 
variable with the specified distribution g(a;|y), often the normal distribution. Then 
Q; 1s eliminated by integrating over this distribution. Specifically, the unconditional 
density for the jth observation is 


Ti 
F (Yiz, 200) YiT; |Xil; XiT,, BYN) =) {TL vison] g(ai|n)da; (22.4) 
t=1 


In nonlinear models, this integral usually has no analytical solution, but numerical 
integration works well because only univariate integration is required. 


The RE approach can be generalized to random-slope parameters (random 
coefficients), not just a random intercept, with a greater computational burden 
because the integral is then of a higher dimension; see the me commands in 
section 23.4. RE models that allow for endogeneity and sample selection can be fit 
using the extended regression model commands xteregress, xteprobit, 
xteoprobit, and xtintreg; see section 23.7. 


22.2.3 Pooled models or PA models 


Pooled models set @; = œ. For parametric models, it is assumed that the marginal 
density for a single (i, t) pair, 


Ff (yet|Xee) = f(a + XB.) 


is correctly specified, regardless of the (unspecified) form of the joint density 

f (yits-- +, iT |Xi1,---, Xr, B, y). The parameter of the pooled model is easily 
estimated, using the cross-sectional command for the appropriate parametric model, 
which implicitly assumes independence over both t and į. A panel—robust or 
cluster—robust (with clustering on 7) estimate of the variance—covariance matrix of 
the estimator (VCE) can then be used to correct standard errors for any dependence 
over time for a given individual. This approach is the analog of pooled ordinary 
least squares (OLS) for linear models. 


Potential efficiency gains can occur if estimation accounts for the dependence 
over time that is inherent in panel data. This is possible for generalized linear 
models (GLMs), defined in section 13.3.7, where one can weight the first-order 
conditions for the estimator to account for correlation over time for a given 
individual but still have estimator consistency provided that the conditional mean is 
correctly specified as E (yi|Xit) = g(a + x},3) for a specified function g(-). This 
is called the PA approach, or generalized estimating equations approach, and is the 
analogue of pooled feasible generalized least squares for linear models. For 
nonlinear models with clustered data, this approach is presented in section 13.9 with 
equicorrelation the natural model for within-cluster correlation. For panel data, it is 
more natural to model correlation over time for a given individual using time-series 
models for autocorrelation. 


For linear panel-data models, the PA estimators of 3, such as OLS, and the RE 
estimator of 3 have the same probability limit if indeed y;, = a; + x4, 8 + uit and 
the usual assumptions of the RE model hold. For nonlinear panel models, this 
generally does not hold, because the models lead to different formulas for 
E(yit|xit). So one can no longer directly compare PA coefficient estimates with RE 
coefficient estimates. Instead, it is more meaningful to compare the MEs, such as 
average marginal effects (AMEs), across different models. The chapter exercise 4 
demonstrates this for the probit panel model. 


22.2.4 CRE models 


In most nonlinear models with short panels, one cannot consistently fit an FE model 
because of the incidental parameters problem. 


Instead, one may use the CRE model, introduced in section 8.7.4, that uses the 
Mundlak correction and assumes F'(a;|xj1,..., vir) = X;y and defines 
a; = X,y + Ni, where ni is an independent error. Then the density (22.1) becomes 
Ff (yit, X48 + Xiy + m). The model is interpreted as a nonlinear RE model in which 
the RE assumptions hold conditionally on both xit and x;. The addition of extra 
controls could make the RE assumption more acceptable. 


This method is demonstrated for a Poisson cluster example in section 13.9.5 and 
for a logit panel example in section 22.4.9. 


22.2.5 Prediction and MEs 


For PA estimators of nonlinear panel models, predictions and MEs are obtained in a 
manner similar to their calculation for the corresponding nonlinear cross-sectional 
model. The default for the predict and margins postestimation commands is then 
the conditional mean, or conditional probability in the case of a binary outcome. 


Computation is more difficult, however, for FE, RE, and CRE estimators of models 
that introduce an individual specific effect a;. Furthermore, the defaults for the 
predict and margins postestimation commands vary with the specifying nonlinear 
model being estimated and whether estimation is by RE or FE. These complications 
do not arise in linear panel models with an additive effect because, for example, if 
E(yit|Xit, a) = XB + ay, then OE (yit|Xit, a1) /OX;i¢ = Bj; does not depend on a; 


In general, E(yit|Xit, &i) = g(Xit, Qi, B). For RE (and CRE) estimators with 
normally distributed random effects, the default for the predict postestimation 
command is to integrate out the random effects and predict 
E(yit|xit) = | E(yit|Xit, a1) h(ailo?)da;, where h(a;|o7) is the N(0, o?) density. 


For those nonlinear models for which consistent FE estimation is possible in 
short panels, little is known about a;, which has been eliminated by appropriate 
differencing. Stata commands set a; = 0 and predict E (yit|Xit, a; = 0). Note that 
E(yit|Xit, &i = 0) can differ greatly from E (yit|X;t) and E(yit|Xit, a;). 


MEs of the conditional mean are based on the preceding predictions, F (yit|Xit) 
for RE and CRE estimators with normal random effects and FE (yit|Xit, a; = 0) for the 
FE model. For the FE model, a potentially better approach is to include individual 


dummies as regressors, leading to biased parameter estimates in a short panel, and 
compute bias-corrected AMEs; section 22.4.8 provides an example. 


In many cases, regressors appear in the single-index form x’, 8. Regardless of 
the treatment of the individual effects, if 6; = 28x, then the ME of changing the jth 
regressor is twice the ME of changing the kth regressor. And, as in the cross- 
sectional case, logit coefficients can be interpreted in terms of changes in the log- 
odds ratio, and Poisson coefficients can be interpreted as semielasticities with 
respect to the conditional mean as in the cross-sectional case. 


22.2.6 Stata nonlinear panel commands 


The Stata commands for PA, RE, and FE estimators of nonlinear panel models are the 
same as for the corresponding cross-sectional model, with the prefix xt. For 
example, xt logit is the command for panel logit. The re option fits an RE model, 
the fe option fits an FE model if this is possible, and the pa option fits a PA model 
with correlation options detailed in section 8.4.3. The xtgee command with 
appropriate options is equivalent to the pa option of the xt logit, xtprobit, and 
xtpoisson commands and is available for a wider range of models, including 
gamma and inverse Gaussian, for which there is no explicit xt command. 


Models with random slopes, in addition to a random intercept, can be fit for a 
range of models using the mixed-effects me commands; see section 23.4. These 
commands are especially useful if interest lies in understanding the nature of 
heterogeneity across individuals. 


Table 22.1 lists the Stata commands for pooled, PA, RE, random slopes, and FE 
estimators of leading nonlinear panel models. In addition to the commands listed in 
this table, the xt streg command fits RE duration models, the cmxtmixlogit 
command fits panel mixed logit models, the xt gee command fits GLM models, and 
the mestreg, meglm, and menl commands enable mixed-effects estimation of, 
respectively, duration-data models, GLM models, and nonlinear models with additive 
errors. 


Table 22.1. Commands for nonlinear panel models 


Binary Multinomial Tobit Counts 
Pooled logit ologit tobit poisson 
probit oprobit intreg nbreg 
cloglog mlogit heckman 
PA xtlogit, pa xtpoisson, pa 
xtprobit, pa xtnbreg, pa 
xtcloglog, pa 
RE xtlogit, re xtologit, re xttobit xtpoisson, re 
xtprobit, re xtoprobit, re xtintreg, re xtpoisson, normal 
xtmlogit, re xtnbreg, re 
Random melogit meologit metobit mepoisson 
slopes meprobit meoprobit meintreg menbreg 
mecloglog 
FE xtlogit, fe xtmlogit, fe xtpoisson, fe 
xtnbreg, fe 
Endogenous xteprobit xteoprobit xteregress 
and selection xtheckman 
xteintreg 


22.2.7 Cluster—robust inference 


The xt commands report default standard errors that are based on correct 
specification of the model for any within-panel correlation. The vce (robust) option 
provides standard errors that are robust to any form of within-panel correlation, 
based on asymptotic theory that requires the number of individuals in the panel 
dataset to be large. By “within-panel correlation”, we mean correlation over time for 
a given individual. 


The vce (robust) option is equivalent to vce (cluster id), where id is the 
panel identifier specified in the xtset command. It presumes independence across 
individuals in the panel. Broader forms of clustering, such as individuals in 
households or villages, should be allowed for using the vce (cluster clustvar) 


option. 


These robust standard errors provide a consistent estimate of the precision of 
parameter estimates. For PA estimators, consistent estimation of the model 
parameters requires correct specification of F'(y;z|x;z), While for most nonlinear RE 
estimators, the requirements for consistent estimation are more demanding and 
include correct specification of the distribution of within-panel correlation. 


22.3 Nonlinear panel-data example 


The example dataset we consider is an unbalanced panel from the Rand 
Health Insurance Experiment. This social experiment randomly assigned 
different health insurance policies to families that were followed for several 
years. The goal was to see how the use of health services varied with the 
coinsurance rate, where a coinsurance rate of 25%, for example, means that 
the insured pays 25% and the insurer pays 75%. Key results from the 
experiment were given in Manning et al. (1987). The data extract we use 
was prepared by Deb and Trivedi (2002). 


22.3.1 Data description and summary statistics 


Descriptive statistics for the dependent variables and regressors follow. 


. * Describe dependent variables and regressors 
. qui use mus218rhie 


. describe dmdu med mdu lcoins ndisease female age lfam child id year 


Variable Storage Display Value 

name type format label Variable label 
dmdu float %9.0g Any MD visit; 1 if mdu>0 
med float %9.0g Medical exp excl outpatient men 
mdu float %9.0g Number face-to-fact md visits 
lcoins float %9.0g log(coinsurance+1) 
ndisease float %9.0g Count of chronic diseases -- ba 
female float %9.0g Female 
age float %9.0g Age that year 
lfam float %⁄9.0g Log of family size 
child float %9.0g Child 
id float 49 .0g Person ID, leading digit is sit 
year float %9.0g Study year 


The corresponding summary statistics are 


. * Summarize dependent variables and regressors 


summarize dmdu med mdu lcoins ndisease female age lfam child id year 


Variable Obs Mean Std. dev. Max 
dmdu 20,186 .6875062 .4635214 1 

med 20,186 171.5892 698 .2689 39182.02 

mdu 20,186 2.860696 4.504765 77 
lcoins 20,186 2.383588 2.041713 4.564348 
ndisease 20,186 11.2445 6.741647 58.6 
female 20,186 .5169424 . 4997252 (0) 1 
age 20,186 25.71844 16.76759 (0) 64.27515 

lfam 20,186 1.248404 .5390681 (0) 2.639057 
child 20,186 .4014168 .4901972 (0) 1 

id 20,186 357971.2 180885.6 125024 632167 

year 20,186 2.420044 1.217237 1 5 


We consider three different dependent variables. The dmdu variable is a 
binary indicator for whether the individual visited a doctor in the current 
year (69% did). The med variable measures annual medical expenditures (in 
dollars), with some observations being 0 expenditures (other calculations 
show that 22% of the observations are 0). The mdu variable is the number of 
(face-to-face) doctor visits, with a mean of 2.9 visits. The three variables are 
best modeled by, respectively, logit or probit models, tobit models, and count 
models. 


The regressors are 1coins, the natural logarithm of the coinsurance rate 
plus one; a health measure, ndisease; and four demographic variables. 
Children are included in the sample. For brevity, we do not include year 
dummies as regressors; their inclusion makes little difference to the 
estimates in this chapter. 


22.3.2 Panel-data organization 


For panel data, we use the xt set command to declare both the individual 
and time identifiers. By contrast, for nonpanel clustered data, one declares 
only the cluster identifier. The xtdescribe command then describes the 
panel-data organization. 


. * Panel description of dataset 
. xtset id year 


Panel variable: id (unbalanced) 
Time variable: year, 1 to 5, but with gaps 
Delta: 1 unit 


. xtdescribe 
id: 125024, 125025, ..., 632167 n= 5908 
year: 1, 2, ..., 5 T = 5 


Delta(year) = 1 unit 
Span(year) = 5 periods 
(id*year uniquely identifies each observation) 


Distribution of T_i: min 5% 25% 50% 75% 95% max 
1 2 3 3 5 5 5 
Freq. Percent Cum. Pattern 
3710 62.80 62.80 Til- 
1584 26.81 89.61 11111 
156 2.64 92.25 Tsu 
147 2.49 94.74 Tlia 
79 1.34 96.07 oa e 
66 1.12 97.19 elt: 
33 0.56 97.75 ..111 
33 0.56 98.31 .1111 
29 0.49 98.80 TE 
71 1.20 100.00 (other patterns) 
5908 100.00 XXXXX 


The panel is unbalanced. Most individuals (90% of the sample of 5,908 
individuals) were in the sample for the first three years or for the first five 
years, which was the sample design. There was relatively small panel 
attrition of about 5% over the first two years. There was also some entry, 
presumably because of family reconfiguration. 


22.3.3 Within and between variation 


Before analysis, it is useful to quantify the relative importance of within and 
between variation. For the dependent variables, we defer this until the 
relevant sections of this chapter. 


The regressor variables 1coins, ndisease, and female are time 
invariant, so their within variation is zero. We therefore apply the xt sum 
command to only the other three regressors. We have 


. * Panel summary of time-varying regressors 
. xtset id year 


Panel variable: id (unbalanced) 
Time variable: year, 1 to 5, but with gaps 
Delta: 1 unit 


. xtsum age lfam child 


Variable Mean Std. dev. Min Max Observations 
age overall 25.71844 16.76759 (0) 64.27515 N = 20186 
between 16.97265 (0) 63.27515 n= 5908 
within 1.086687 23 . 46844 27 .96844 T-bar = 3.41672 
lfam overall 1.248404 .5390681 (0) 2.639057 N = 20186 
between . 5372082 (0) 2.639057 n= 5908 
within .0730824 . 3242075 2.44291 T-bar = 3.41672 
child overall .4014168 .4901972 (0) 1 N = 20186 
between . 4820984 (0) 1 n= 5908 
within .1096116 -.3985832 1.201417 T-bar = 3.41672 


For the regressors age, 1fam, and child, most of the variation is between 
variation rather than within variation. We therefore expect that FE estimators 
will not be very efficient because they rely on within variation. Also, the FE 
parameter estimates may differ considerably from the other estimators if the 
within and between variation tell different stories. 


22.3.4 FE or RE model for these data? 


More generally, for these data we expect a priori that there is no need to use 
FE models. The point of the Rand experiment was to eliminate the 
endogeneity of health insurance choice, and hence endogeneity of the 
coinsurance rate, by randomly assigning this to individuals. The most 
relevant models for these data are RE or PA, which essentially just correct for 
the panel complication that observations are correlated over time for a given 
individual. 


22.4 Binary outcome and ordered outcome models 


We fit logit models for whether an individual visited a doctor (dmdu). Similar 
methods apply for probit and complementary log-log models using the xt probit 
and xtcloglog commands, except that there is then no FE estimator. 


22.4.1 Panel summary of the dependent variable 


We begin by studying the correlation over time for a given individual of the binary 
dependent variable. 


. * Logit: Panel summary of dependent variable 
. xtsum dmdu 


Variable Mean Std. dev. Min Max Observations 
dmdu overall -6875062 -4635214 (0) 1 N = 20186 
between . 3571059 (0) 1 n= 5908 
within .3073307 -.1124938 1.487506 T-bar = 3.41672 


The dependent variable amau has within variation and between variation of similar 
magnitude. 


The xttrans command summarizes transitions from one period to the next. 


. * Year-to-year transitions in whether visit doctor 
. xttrans dmdu 


Any MD Any MD visit; 1 if 
visit; 1 mdu>0O 
if mdu>0 (0) 1 Total 
(0) 58.87 41.13 100.00 
1 19.73 80.27 100.00 
Total 31.81 68.19 100.00 


There is considerable persistence from year to year: 59% of those who did not visit 
a doctor one year also did not visit the next, while 80% of those who did visit a 
doctor one year also visited the next. 


The pwcorr command gives the time-series correlation of the dependent 
variable. 


* Correlations over time in the dependent variable 
. pwcorr dmdu 1.dmdu 12.dmdu 13.dmdu 14.dmdu 


dmdu L.dmdu L2.dmdu L3.dmdu L4.dmdu 


dmdu 1.0000 
L.dmdu 0.3884 1.0000 
L2.dmdu 0.3599 0.3807 1.0000 
L3.dmdu 0.3351 0.3400 0.3563 1.0000 
L4.dmdu 0.3329 0.3221 0.3372 0.3503 1.0000 


The correlations in the dependent variable, dmdu, vary little with lag length, unlike 
the chapter 8 example of log wage where correlations decrease as lag length rises. 
Such constancy is unusual for panel data and suggests that estimators that impose 
equicorrelation may be quite reasonable for these data. 


22.4.2 Pooled logit estimator 
The pooled logit model is the usual cross-sectional model, 

Pr(yit = Lxi) = A(x},3) (22.5) 
where A(z) = e*/(1 + e7). A cluster—robust estimate for the vce is then used to 


correct for error correlation over time for a given individual. 


The logit command with the vce (cluster id) option yields 


. * Logit cross-section with panel-robust standard errors 
. logit dmdu lcoins ndisease female age lfam child, vce(cluster id) nolog 


Logistic regression Number of obs = 20,186 
Wald chi2(6) = 488.18 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -11973.392 Pseudo R2 = 0.0450 
(Std. err. adjusted for 5,908 clusters in id) 

Robust 
dmdu | Coefficient std. err. z P>|z| [95% conf. interval] 
lcoins -.1572107 .0109064 -14.41 0.000 -.1785869 -.1358345 
ndisease .050301 . 0039657 12.68 0.000 .0425285 .0580735 
female . 3091573 .0445772 6.94 0.000 .2217876 . 396527 
age . 0042689 . 0022307 1.91 0.056 -.0001032 .008641 
lfam -.2047573 .0470287 -4.35 0.000 -.2969317 -.1125828 
child .0921709 0728107 1.27 0.206 -.0505355 . 2348773 
_cons .6039411 .1107712 5.45 0.000 . 3868335 . 8210486 


The first four regressors have the expected signs. The negative sign of 1fam may be 
due to family economies of scale in healthcare. The positive coefficient of child 
may reflect a u-shaped pattern of doctor visits with age. The estimates imply that a 
child of age 10, say, is as likely to see the doctor as a young adult of age 31 because 
0.092 + 0.0043 x 10 ~ 0.0043 x 31 = 0.1333. 


The estimated coefficients can be converted to MEs by using the margins, 
dydx (*) command or, approximately, by multiplying by 
y(1 — y) = 0.69 x 0.31 = 0.21. For example, the probability of a doctor visit at 
some stage during the year is 0.07 higher for a woman than for a man because 
0.31 x 0.21 = 0.07. 


In output not given, the default standard errors are approximately two-thirds 
those given here, so the use of cluster—robust standard errors is necessary. 


22.4.3 The xtlogit command 


The pooled logit command assumes independence over ; and t, leading to potential 
efficiency loss, and ignores the possibility of fixed effects that would lead to 
inconsistent parameter estimates. 


These panel complications are accommodated by the xt logit command, which 
has the syntax 


xtlogit depvar | indepvars | [if] [ in | | weight | E options | 


The options are for PA (pa), RE (re), and FE (fe) estimators. Panel—robust standard 
errors for PA and RE estimators can be calculated by using the vce (robust) option. 
The FE estimator requires the assumption of within-panel independence, so the 

vce (robust) option is not available. Model-specific options are discussed below in 
the relevant model section, and postestimation prediction and MEs are presented in 
section 22.4.11. 


22.4.4 The xtgee command 


The pa option for the xt logit command is also available for some other nonlinear 
panel commands, such as xtpoisson. It is a special case of the xtgee command for 
generalized estimating equations, a cluster generalization of GLMs. This command 
has the syntax 


xtgee depvar [ indepvars | [ of | [ in | [ weight | ls options | 


The family () and 1ink() options define the specific model. For example, the linear 
model is family(gaussian) link(identity); the logit model is family (binomial) 
link(logit). Other family () options are poisson, nbinomial, gamma, and 
igaussian (inverse Gaussian). 


The corr() option defines the pattern of time-series correlation assumed for 
observations on the ¿th individual. These patterns include exchangeable for 
equicorrelation, independent for no correlation, and various time-series models that 
have been detailed in section 8.4.3. 


In the examples below, we obtain the PA estimator by using commands such as 
xtlogit with the pa option. If instead the corresponding xtgee command is used, 
then the estat wcorrelation postestimation command produces the estimated 
matrix of the within-group correlations. 


22.4.5 PA logit estimator 


The PA estimator of the parameters of (22.5) can be obtained by using the xtlogit 
command with the pa option. Different arguments for the corr () option, presented 
in section 8.4.3 and in [XT] xtgee, correspond to different models for the correlation 


Pts = Cor[{yie — A(x B) Huis — A(X; B), s At 


The corr (independent) option gives the pooled logit estimator. 


The exchangeable model assumes that correlations are the same regardless of 
how many years apart the observations are, so Pts = p. For our data, this model may 
be adequate because, from section 22.4.1, the correlations of dmdu varied little with 
the lag length. Even with equicorrelation, the covariances can vary across 
individuals and across year pairs because, given Var(y;t|Xit) = Aie(1 — Ai), the 
implied covariance is py/Ajt(1 — Ait) X Ais(1 — Ais). 


Estimation with the xtlogit, pa corr(exch) command yields 


. * Pooled logit cross-section with exchangeable errors and panel-robust VCE 


. xtlogit dmdu lcoins ndisease female age lfam child, pa corr(exch) 
vce(robust) nolog 


> 


GEE population-averaged model 


Group variable: id 
Family: Binomial 


Link: 


Logit 


Correlation: exchangeable 


Scale parameter 


Number of obs 


Number of groups 
Obs per group: 
min = 


max 


Wald chi2(6) 
Prob > chi2 


ll 
ol 
xe) 
fo) 
© 


3.4 
= 5 
= 521.45 
0.0000 


(Std. err. adjusted for clustering on id) 

Robust 
dmdu | Coefficient std. err. z P>l|z| [95% conf. interval] 
lcoins -.1603179 .0107779 -14.87 0.000 -.1814422 -.1391935 
ndisease .0515445 . 0038528 13.38 0.000 . 0439931 .0590958 
female . 2977003 .0438316 6.79 0.000 .211792 . 3836086 
age .0045675 .0021001 2.17 0.030 .0004514 . 0086836 
lfam -. 2044045 .0455004 -4.49 0.000 - . 2935837 -.1152254 
child . 1184697 .0674367 1.76 0.079 -.0137039 . 2506432 
_cons .5776986 . 106591 5.42 0.000 . 368784 . 7866132 


The pooled logit and PA logit parameter estimates are very similar. The cluster— 
robust standard errors are slightly lower for the PA estimates, indicating a slight 
efficiency gain. The parameter estimates can be interpreted in exactly the same way 
as those from a cross-sectional logit model. 


The within correlations are stored in the e (R) matrix. We have 


* Within correlation of the exchangeable errors 


. matrix list e(R) 


symmetric e(R) [5,5] 


ri 
r2 
r3 
r4 
r5 


So Pts = p= 0.34. 


c1 
1 


. 34033336 
. 34033336 
. 34033336 
. 34033336 


. 34033336 


. 34033336 1 
. 34033336 


c2 c3 
1 

. 34033336 

. 34033336 


22.4.6 RE logit estimator 


c4 


1 
. 34033336 


The logit individual-effects model specifies that 


c5 


Pr(yit = 1|xit, B, ai) = Alai + Xb) 


(22.6) 


where @; may be a fixed effect or a random effect. 
The logit RE model specifies that œ; ~ N (0, o2). Then the joint density for the ; 
th observation, after integrating out &;, is 


Ti 


Pvits «+s yur) = f ll Alai + xB {1 — A(ay + x),8)}'-4* 
t=1 


g(aj;|o7)da; (22.7) 


where g(a;ļa?) is the N (0, o2) density. After a; is integrated out, 
Pr(yit = 1|xiz, 8) 4 A(x 8), so the RE model parameters are not comparable with 
those from pooled logit and PA logit. 


There is no analytical solution to the univariate integral (22.7), so numerical 
methods are used. The default method is adaptive 12-point Gauss—Hermite 
quadrature. The intmethod() option allows other quadrature methods to be used, 
and the intpoints() option allows the use of a different number of quadrature 
points. The quadchk command checks whether a good approximation has been 
found by using a different number of quadrature points and comparing solutions; 
see [XT] xtlogit and [xT] quadchk for details. 


The RE estimator is implemented by using the xt logit command with the re 
option. We have 


. * Logit RE estimator 
. xtlogit dmdu lcoins ndisease female age lfam child, re nolog vce(robust) 


Calculating robust standard errors ... 


Random-effects logistic regression Number of obs = 20,186 
Group variable: id Number of groups = 5,908 

Random effects u_i ~ Gaussian Obs per group: 
min = 1 
avg = 3.4 
max = 5 
Integration method: mvaghermite Integration pts. = 12 
Wald chi2(6) = 526.83 
Log pseudolikelihood = -10878.687 Prob > chi2 = 0.0000 
(Std. err. adjusted for 5,908 clusters in id) 

Robust 

dmdu | Coefficient std. err. Zz P>|z| [95% conf. interval] 
lcoins - . 2403864 .016318 -14.73 0.000 -.272369 -.2084038 
ndisease .078151 . 0057388 13.62 0.000 . 0669032 . 0893988 
female .4631005 .0668151 6.93 0.000 . 3321454 . 5940555 
age .0073441 .0031756 2.31 0.021 .0011199 .0135682 
lfam -.3021841 .0675885 -4.47 0.000 -.4346551 -.169713 
child . 1935357 . 101308 1.91 0.056 - .0050243 . 3920956 
_cons . 8629898 . 1598343 5.40 0.000 . 5497203 1.176259 
/lnsig2u 1.225652 .0495481 1.12854 1.322765 
sigma_u 1.84564 .045724 1.758164 1.937469 
rho . 5087003 .01233833 . 4844281 .5329316 


The RE model fits much better than the cross-sectional logit model, with log 
likelihood increasing from — 11973.4 to — 10878.7. The coefficient estimates are 
roughly 50% larger in absolute value than those of the pa model. The standard errors 
are also roughly 50% larger, so the ¢ statistics are little changed. Clearly, the RE 
model has a different conditional mean than the PA model, and the parameters are 
not directly comparable. 


Computation and comparison of MEs is presented in section 22.4.11. Without 
obtaining MEs, note that from (22.7), Pr(yie = 1|Xi+, B, a;) is of single-index form. 
As discussed in section 22.2.5, if one coefficient 1s twice as big as another, then the 
MES are twice as large. 


The output also includes sigma_u, the estimate of the standard deviation ca of 
the random effects, so it is estimated that a; ~ N(0, 1.8467). The logit RE model 
can be motivated as coming from a latent-variable model, with y; = 1 if 
Ya = XB + ai + Ei > 0, where £i is logistically distributed with a variance of 

2 


o? = n? /3. By a calculation similar to that in section 8.3.10, the intraclass error 


correlation in the latent-variable model is p = o2 /(o2 + o2). Here 
P = 1.8467/(1.846? + t? /3) = 0.509, the quantity reported as rho. 


22.4.7 FE logit estimator 


In the FE model, the œ; may be correlated with the covariates in the model. 
Parameter estimation is difficult, and many of the approaches in the linear case fail. 
In particular, the least-squares dummy-variable (DV) estimator of section 8.5.4 
yielded a consistent estimate of G, but a similar Dv estimator for the logit model 
leads to inconsistent estimation of 6 in the logit model, unless T — oo. 


One method of consistent estimation eliminates the a; from the estimation 
equation. This method is the conditional maximum-likelihood (ML) estimator, which 
is based on a log density for the ¿th individual that conditions on a i Yiv the total 


number of outcomes equal to 1 for a given individual over time. 
We demonstrate this in the simplest case of two time periods. Condition on 


Yii + Yi2 = 1, so that y;, = 1 in exactly one of the two periods. Then, in general, 


_ Priya = 0, yi2 = 1) 
Pr(ya = 0, yi2 = 1) + Pr(yir = 1, yi2 = 0) 


Pr(yi = 0, yi2 = 1 yar + yiz = 1) (22.8) 


Now Pr(yi1 = 0, yi2 = 1) = Pr(yi = 0) x Pr(yig = 1), assuming that Y1; and y2i 
are independent given Q; and Xit. For the logit model (22.6), we obtain 


1 exp(a; + Xia 6) 
Pr(y; = 0, yi2 = 1) =. oe X raae 


Similarly, 


Pr(ya = 1, yi2 = 0) = exp(ai + X; 6) ” l 
METAT 1 + expla; +x; 8) 1+ exp(a; + x8) 


When we substitute these two expressions into (22.8), the denominators cancel, and 
we obtain 


Pr(ya = 0, yo = 1] yaa + yr = 1) 

exp(a; + X49) /{exp(ai + x}, 8) + exp(ai + Xj) } 

= exp(Xj9)/{exp(x}, 3) + exp(xj23)} 

= exp{(xi2 — xi1)’B}/[1 + exp{(xi2 — xi1)' BY (22.9) 


There are several results. First, conditioning eliminates the problematic fixed 
effects a;. Second, the resulting conditional model is a logit model with the 
regressor X;2 — X,1. Third, coefficients of time-invariant regressors are not 
identified, because then xj. — zi} = 0. 


More generally, with up to T outcomes, we can eliminate a; by conditioning on 
a Yit = 1 and on es Yit = 2, SL yit = T — 1. This leads to the loss of 
those observations where yit is 0 for all ¢ or Yit is 1 for all T. The resulting 
conditional model is more generally a multinomial logit model. For details, see, for 
example, Cameron and Trivedi (2005, 796—797) or [R] clogit. 


The FE estimator is obtained by using the xt logit command with the fe option. 
We have 


. * Logit FE estimator 

. xtlogit dmdu lcoins ndisease female age lfam child, fe nolog 

note: multiple positive outcomes within groups encountered. 

note: 3,459 groups (11,161 obs) omitted because of all positive or 
all negative outcomes. 

note: lcoins omitted because of no within-group variance. 

note: ndisease omitted because of no within-group variance. 

note: female omitted because of no within-group variance. 


Conditional fixed-effects logistic regression Number of obs = 9,025 
Group variable: id Number of groups 


il} 
N 
ws 
iS 
ive} 


Obs per group: 


min = 2 

avg = 3.7 

max = 5 

LR chi2(3) = 10.74 

Log likelihood = -3395.5996 Prob > chi2 = 0.0132 

dmdu Coefficient Std. err. z P>|z| [95% conf. interval] 
lcoins O (omitted) 
ndisease O (omitted) 
female O (omitted) 

age -.0341815 .0183827 -1.86 0.063 -.070211 .001848 

lfam .478755 . 2597327 1.84 0.065 -.0303116 . 9878217 

child . 270458 . 1684974 1.61 0.108 -.0597907 .6007068 


As expected, coefficients of the time-invariant regressors are not identified, and 
these variables are dropped. The 3,459 individuals with ean yit = 0 (all Os) or 
Da yit = T; (all 1s) are dropped because there is then no variation in yit over t, 
leading to a loss of 11,161 of the original 20,186 observations. Standard errors are 
substantially larger for FE estimation because of this loss of observations and 
because only within variation of the regressors is used. 


The coefficients are considerably different from those for the RE logit model, and 
in two cases, the sign changes. The interpretation of parameters is similar to that 
given at the end of section 22.4.6 for the RE model. 


22.4.8 Bias-corrected logit dummy-variable FE estimator 


The logit Dv estimator of the FE model obtains logit estimates of both the slope 
parameters 8 and the fixed effects a;,i = 1,..., N. In a short panel, this leads to 
inconsistent estimates of all parameters and subsequent MEs. 


Hahn and Newey (2004) present general methods for bias reduction using 
analytical formulas or using the jackknife. Fernandez-Val (2009) specialized these 
methods to obtain bias-corrected estimates of both parameters and the consequent 
AMES for static and dynamic logit and probit models. For model parameters, the 
order of the bias is reduced from O(T~!) to O(T~?) under the assumption that 
T/N!’ —+ oo. Fernandez-Val and Weidner (2016) considered the long panel case 
with both time individual fixed effects and time fixed effects, where both T — and 
N — œ and obtained bias corrections assuming T/N — c for some constant c. 


The logitfe and probitfe commands (Cruz-Gonzalez, Fernandez-Val, and 
Weidner 2017) provide jackknife and analytical bias-corrected estimates of 
parameters and MEs in FE logit and probit models in short and long panels. 


The following command provides short-panel analytical bias-corrected 
estimates of parameters and AMEs in the logit FE model. 


. global xlist lcoins ndisease female age lfam child 


. * Logit FE DV estimator with analytical bias correction 
. logitfe dmdu $xlist, teffects(no) analytical 


(output omitted ) 
. estimates store FEDVBC 


The parameter estimates are given in section 22.4.10, and AMEs are given in 
section 22.4.11. The option nocorrection gives estimates without the bias 


correction and is equivalent to logit dmdu i.id $xlist. The option jackknife 
gives bias-corrected estimates obtained using the panel delete-one-jackknife. 


22.4.9 CRE logit estimator 


The logit CRE estimator is obtained as the logit RE estimator with inclusion as 
additional regressors the means over time of regressors that vary over both 
individual and time. We have 


* Logit-correlated RE estimator 
. bysort id: egen aveage = mean(age) 


. by id: egen avelfam = mean(lfam) 

. by id: egen avechild = mean(child) 

. xtlogit dmdu $xlist aveage avelfam avechild, re vce(robust) 
(output omitted ) 


. estimates store CRE 


The parameter estimates are given in the next subsection. 
22.4.10 Comparison of panel logit parameter estimates 


We combine the preceding estimators into a single table that makes comparison 
easier. The models are 


* Panel logit estimator comparison 
. global xlist lcoins ndisease female age lfam child 


. qui logit dmdu $xlist, vce(cluster id) 

. estimates store POOLED 

. qui xtlogit dmdu $xlist, pa corr(exch) vce(robust) 

. estimates store PA 

. qui xtlogit dmdu $xlist, re vce(robust) 

. estimates store RE 

. qui xtlogit dmdu $xlist, fe // vce(robust) not available 
. estimates store FE 

. qui logitfe dmdu $xlist, teffects(no) nocorrection 

. estimates store FEDV 

. qui logitfe dmdu $xlist, teffects(no) analytical 

. estimates store FEDVBC 

. qui xtlogit dmdu $xlist aveage avelfam avechild, re vce(robust) 


. estimates store CRE 


The slope parameter estimates are 


* Panel logit estimator comparison results 
. estimates table PA RE FE FEDV FEDVBC CRE, equations(1) se b(%7.4f) 


> stats(N 11) stfmt(47.0f) varwidth(7) 
Variable PA RE FE FEDV FEDVBC 
#1 
lcoins -0.1603 -0.2404 (omitted) (omitted) (omitted) 
0.0108 0.0163 
ndisease 0.0515 0.0782 (omitted) (omitted) (omitted) 
0.0039 0.0057 
female 0.2977 0.4631 (omitted) (omitted) (omitted) 
0.0438 0.0668 
age 0.0046 0.0073 -0.0342 -0.0453 -0.0351 
0.0021 0.0032 0.0184 0.0212 0.0212 
lfam -0.2044 -0.3022 0.4788 0.6507 0.4809 
0.0455 0.0676 0.2597 0.3033 0.3033 
child 0.1185 0.1935 0.2705 0.3616 0.2749 
0.0674 0.1013 0.1685 0.1955 0.1955 
aveage 
avelfam 
avechild 
_cons 0.5777 0.8630 
0.1066 0.1598 
/lnsig2u 1.2257 
0.0495 
Statis”s 
N 20186 20186 9025 9025 9025 
11 -10879 -3396 -5515 -5515 


Legend: b/se 


Variable CRE 
#1 
lcoins -0.2412 
0.0163 
ndisease 0.0784 
0.0058 
female 0.4620 
0.0670 
age -0.0342 
0.0186 
lfam 0.5008 
0.2743 
child 0.2950 
0.1783 
aveage 0.0404 
0.0190 
avelfam -0.8380 
0.2834 
avechild -0.1511 
0.2174 
_cons 0.9578 
0.1716 
/lnsig2u 1.2289 
0.0496 
Statis”s 
N 20186 
11 -10871 


Legend: b/se 


The pooled logit estimates with cluster—robust standard errors are omitted from 
the table but are very similar to the PA estimates. The RE logit estimates differ quite 
substantially from the PA logit estimates, though, as already noted, the associated ¢ 
statistics are quite similar. 


The remaining columns present various estimates of the FE model. The 
consistent FE estimates are much less precise than the preceding estimates and are 
available only for time-varying regressors. Note also that panel-robust standard 
errors are unavailable for the logit FE estimator because its consistency requires that 
observations be independent over time for a given individual once the fixed effect is 
included. The logit Dv estimates (column FEDv) differ substantially from the FE 
estimates because of incidental parameters bias, which is large here with at most 
five time periods. The bias-corrected Dv estimates (column FEDVBC) are within 10% 
of the FE estimates with standard errors that are around 20% larger. The CRE 
estimates for the time-varying regressors age, 1fam, and child are within 10% of 


the FE estimates, as are the corresponding standard errors. In this example, the 
various FE estimators, aside from inconsistent logit FEDvs, give similar results. 


22.4.11 Panel logit prediction and MEs 


The predict postestimation command has several options that vary depending on 
whether the xt logit command was used with the pa, re, or fe option. The if 
qualifier of the predict and margins commands can be used to obtain predictions 
and MEs for specific years. 


After the xtlogit, pa command, the default predict option is mu, which gives 
the predicted probability given in (22.5). After the xtlogit, re command, the 
default predict option is xb. This gives the marginal or unconditional probability 
that integrates out a;, so Pr(yi = 1|xit, 8) = f Alai + x4,8)h(ai|o7)da;, where 
h(a;\o7) is the N(0, a2) density. After the xtlogit, fe command, the default 
predict option is pc1, which produces the conditional probability that y;, = 1 
given that exactly 1 of Yi1,---, Yi, equals 1. This is used because this conditional 
probability does not depend on a;; the formula in the special case T; = 2 is given in 
(22.9). An alternative predict option is pu0, which gives the predicted probability 
when a; = 0. 


The following summarizes the default AMEs for the PA, RE, and FE logit models. 
For PA and RE models, these are based on the predict defaults, while for the FE 
model, the margins default is puo, which gives the MEs when a; = 0 (pri is not 
available). 


. * AMEs for PA, RE, and FE estimates 
. foreach model in pa re fe { 


2. qui xtlogit dmdu $xlist, ~model~ 
3. qui margins, dydx(*) 
4. display "Model ~model“:" _col(11) "lcoins" _col(21) "ndisease" 
> _col(31) "female" _col(41) "age" _col(51) "lfam" _col(61) "child" 
5. display _col(11) %7.4f r(b) [1,1] _col(21) %7.4f r(b) [1,2] 
> _col(31) %7.4f r(b) [1,3] _col(41) %7.4f r(b) [1,4] 
> _col(51) %47.4f r(b) [1,5] _col(61) %7.4f r(b) [1,6] 
6. } 
Model pa: lcoins ndisease female age lfam child 
-0.0326 0.0105 0.0606 0.0009 -0.0416 0.0241 
Model re: lcoins ndisease female age lfam child 
-0.0320 0.0104 0.0617 0.0010 -0.0403 0.0258 
Model fe: lcoins ndisease female age lfam child 


0.0000 0.0000 0.0000 -0.0075 0.1047 0.0591 


The ames following PA and RE estimation are very similar, and those for the PA 
estimator are much simpler to compute, as emphasized by Drukker (2008). The 


AMES following FE estimation differ substantially because they are for 
Pr(yit = 1|xiz, œi = 0) rather than for Pr(y;, = 1|x;z). 


For the FE model, the AMEs for Pr(y;¢ = 1|x;z) can be obtained using the 
logitfe command with bias correction. For brevity, we omit the intermediate 
lengthy output that includes parameter estimates. We obtain 


. * AMEs for bias-corrected DV FE estimator 
. logitfe dmdu $xlist, teffects(no) analytical 


(output omitted ) 
Average Partial Effects 


dmdu Coefficient Std. err. z P>lz| [95% conf. interval] 
lcoins O (omitted) 
ndisease O (omitted) 
female O (omitted) 

age -.0055163 . 004079 -1.35 0.176 -.013511 .0024784 

lfam .0756653 .0585377 1.29 0.196 -.0390665 . 1903972 

child .0427447 .0180019 2.37 0.018 .0074615 .0780278 


The AMEs differ from the preceding AMEs for the FE estimator, as expected. They 
also differ from those following PA and RE estimation, including two sign changes, 
though the sign changes are for AMEs that are statistically insignificant at 5%. 


A faster alternative is to fit the CRE model. We obtain 


. * AMEs for CRE estimator 
. qui xtlogit dmdu $xlist aveage avelfam avechild, re vce(robust) 


. margins, dydx(*) 

Average marginal effects Number of obs = 20,186 
Model VCE: Robust 

Expression: Pr(dmdu=1), predict (pr) 

dy/dx wrt: lcoins ndisease female age lfam child aveage avelfam avechild 


Delta-method 
dy/dx std. err. z P>lzl [95% conf. interval] 
lcoins -.0320821 0020815 -15.41 0.000 -.0361618 - .0280024 
ndisease 0104306 .0007412 14.07 0.000 . 0089779 .0118834 
female .0614568 . 008835 6.96 0.000 .0441404 .0787731 
age -.0045543 0024691 -1.84 0.065 - . 0093936 . 0002849 
lfam . 0666244 . 0364848 1.83 0.068 - .0048844 . 1381332 
child . 0392392 .0237247 1.65 0.098 -.0072603 .0857387 
aveage .0053778 .00252 2.13 0.033 . 0004387 .0103169 
avelfam -.1114862 . 0376606 -2.96 0.003 -.1852996 -.0376727 
avechild -.0200974 . 0289229 -0.69 0.487 -.0767852 . 0365905 


The AMEs for age, 1fam, and child are, respectively, — 0.0046, 0.0666, and 0.0392, 
quite similar to — 0.0055, 0.0757, and 0.0427 for the bias-corrected Dv estimator of 
the FE model. 


22.4.12 Mixed-effects logit estimator 


The RE logit model specifies only the intercept to be normally distributed. Slope 
parameters may also be normally distributed. The melogit command estimates the 
parameters of this model, which is the logit extension of mixed for linear models, 
presented in section 6.7.3. 


For example, the following yields the same estimates as xt logit with the re 
option, aside from minor computational difference: 


. * Following identical to xtlogit, re command 
. melogit dmdu lcoins ndisease female age lfam child || id: 


(output omitted ) 


Adding the 1coins and ndisease variables after id: allows the intercept and 
slope parameters for 1coins and ndisease to be jointly normally distributed. Then a 
trivariate integral is computed with the Gauss—Hermite quadrature, and estimation is 
very computationally intensive; without restrictions on the variance matrix, the 
model may not even be estimable. 


As in the linear case, the mixed logit model is used more for clustered data than 
for panel data and is mostly used in areas of applied statistics other than 
econometrics. 


22.4.13 Dynamic panel logit models 


Many current discrete economic outcomes are partly determined by past outcomes. 
Reasons underlying such dependence include the force of inertia, costs of 
adjustment, habit persistence, and true state dependence. For example, the 
probability that a currently unemployed person would remain unemployed in the 
next period may depend on previous unemployment duration. 


There are several ways of modeling dynamic dependence, two of which are 
standard. The simplest and most direct is to capture dependence on past values of 
exogenous variables by including them as regressors; for example, the function 
Pr(yi = L|£it, 2-1) implies that a change in x has both a contemporaneous and a 
lagged effect. Then one can still use either an FE or RE formulation to fit the model 
and interpret the results, as discussed in preceding sections. 


A second standard method is to use an autoregressive model with lagged 
dependent variables as regressor to capture dynamic dependence. The lagged 
dependent variable is treated as a shifter of the current transition probability after 
controlling for other regressors and individual unobserved heterogeneity. If there are 
compelling reasons to model dynamics by introducing Yit—1 as a regressor, then 
estimation is straightforward under RE assumptions. The presence of a lagged 
dependent variable implies that the effect of any shock has an impact on outcome in 
all future periods. Hence, a distinction between contemporaneous MEs and dynamic 
MES arises in principle. There is scant literature that discusses this issue in the 
context of discrete outcomes. 


An FE logit model with lagged dependent variable adds Yit—1 as a regressor in 
(22.6). Conceptually, this model is attractive because it allows for distinction 
between unobserved heterogeneity (œ;) and true state dependence (Yit—1). 
Estimation is challenging because the usual tricks for eliminating fixed effects (and 
time-invariant regressors) do not work. 


Initial work by Honoré and Kyriazidou (2000) considered a pure autoregressive 
panel model. Bartolucci and Nigro (2010) considered a more general regression 
model with a lagged dependent variable that is not equivalent to the logit model but 
mimics it and hence may be treated as an alternative. This alternative model, 
referred to as the conditional quadratic exponential specification, adds interaction 
terms of form Yit—-14YiT (hence the term “quadratic”), which enables elimination of 
the a; by conditioning on J}; yi. It uses only the subset of observations in the 
sample for which at least one transition is recorded. The community-contributed 
cquad package (Bartolucci 2015) implements several variants of this estimator. 


An alternative method is to obtain bias-corrected estimates from regular logit 
regression of yit on individual-specific Dvs and regressors that include the lagged 
dependent variable. The logit fe command (Cruz-Gonzalez, Fernandez-Val, and 
Weidner 2017) also covers this case and can provide both parameter estimates and 
AMES. In the dynamic case, one needs to use lags (#) to specify the value of a 
necessary trimming parameter. 


22.4.14 Panel probit models 


Panel probit models can be fit using the xtprobit command, which is similar to the 
xt logit command and has similar options. 


An important difference, however, is that the fe option is not available because, 
unlike for the panel logit model, conditioning on es yit does not eliminate the 


fixed effects. Recent work has pointed out the often-referenced result that consistent 
estimation of the slope coefficients of the probit FE model in a short panel is 
restricted to the case T = 2. It is possible that consistent estimation procedures will 
be developed for T > 2. 


For FE models, the easiest way to proceed is to obtain bias-corrected estimates of 
a panel probit Dv model using the probit fe command (Cruz-Gonzalez, Fernández- 
Val, and Weidner 2017) or to fit a probit CRE model. Both estimators can also yield 
AMES. 


22.4.15 Panel ordered logit and probit models 


The outcome variable sometimes may be categorical and ordered. Measuring 
outcomes on an ordinal scale is common, especially for outcomes that are measured 
subjectively. For example, a patient may be asked to report pain intensity level on a 
10-point scale or personal health status according to five categories such as poor, 
fair, good, very good, and excellent. When such responses are obtained in repeated 
samples from the same individuals, ordered logit or probit models provide a natural 
framework for analyzing such multinomial outcomes; see section 18.9. 


Stata provides two commands, xtologit and xtoprobit, for RE estimation of 
ordered categorical panel data. The syntax of these commands is similar to that for 
the logit and probit panel models. 


We illustrate the xtologit command by analyzing recoded Rand count data on 
annual doctor visits (mdu). The 7 recoded categories are as follows: 0 (no visits), 1 
(1 or 2 visits), 2 (3 or 4 visits), 3 (5 or 6 visits), 4 (7 or 8 visits), 5 (8 to 12 visits), 
and 6 (more than 12 visits). This categorization is somewhat arbitrary and only 
illustrative. 


The following descriptive summary displays some key features of the outcome 
variable. 


* Tabulate cmdu generated by recoding count into seven ordered categories 
. recode mdu (0 = 1) (1/2 = 2) (3/4 = 3) (5/6 = 4) (7/8 = 5) 
> (8/12 = 6) (13/999 = 7), gen(cmdu) 
(15507 differences between mdu and cmdu) 


. tabulate cmdu 


RECODE of 
mdu (Number 
face-to-fac 
t md 

visits) Freq. Percent Cum. 

1 6,308 31.25 31.25 

2 6,610 32.75 63.99 

3 3,229 16.00 79.99 

4 1,657 8.21 88.20 

5 939 4.65 92.85 

6 801 3.97 96.82 

7 642 3.18 100.00 

Total 20,186 100.00 


The first three categories account for nearly 80% of the probability mass. The more- 
than-12-visits category is the smallest, accounting for just over 3% of the 
probability mass. 


The xtologit command is next executed with the cluster—robust option for 
standard errors. 


. * RE ordered logit: Estimates 
. xtologit cmdu lcoins ndisease female age lfam child, vce(cluster id) nolog 


Random-effects ordered logistic regression Number of obs = 20,186 
Group variable: id Number of groups = 5,908 

Random effects u_i ~ Gaussian Obs per group: 
min = 1 
avg = 3.4 
max = 5 
Integration method: mvaghermite Integration pts. = 12 
Wald chi2(6) = 766.74 
Log pseudolikelihood = -29148.65 Prob > chi2 = 0.0000 
(Std. err. adjusted for 5,908 clusters in id) 

Robust 

cmdu | Coefficient std. err. z P>liz| [95% conf. interval] 
lcoins - . 2304359 .0144439 -15.95 0.000 -.2587454 -.2021265 
ndisease .0832606 . 004843 17.19 0.000 .0737685 .0927527 
female .4172223 . 0602473 6.93 0.000 . 2991398 . 5353048 
age .0064605 . 0028157 2.29 0.022 .0009418 .0119793 
lfam - .3604544 .0588079 -6.13 0.000 -.4757157 -.2451931 
child . 2004983 . 0863054 2.32 0.020 .0313428 . 3696537 
/cuti -.9207318 . 141173 -1.197426 -.6440378 
/cut2 1.312918 . 1416307 1.035327 1.590509 
/cut3 2.612945 . 1428847 2.332896 2.892994 
/cut4 3.581754 . 1444939 3.298551 3.864957 
/cut5 4.407798 . 1470729 4.119541 4.696056 
/cut6 5.627819 .1531991 5.327555 5.928084 
/sigma2_u 3.666984 . 1353704 3.411034 3.942139 


The individual-specific random component of the intercept follows the normal 
distribution whose estimated variance is reported in the last row of the output. The 
log likelihood does not have closed expression, so numerical integration is used in 
optimization. The regression coefficients indicate qualitatively the same signs as in 
the RE panel count models reported in later sections of this chapter. We focus on the 
key variable 1coins, the log of the coinsurance rate. The result indicates that a 
higher coinsurance rate is associated with fewer visits on average. As in the case of 
ordered logit, the output includes the six estimated cutoffs. 


The regression coefficients are less informative than the mes. Next we use the 
margins command to obtain the AMEs of 1coins. 


. * RE ordered logit: AMEs of coinsurance rate on probability of category outcomes 
. margins, dydx(lcoins) 


Average marginal effects Number of obs = 20,186 
Model VCE: Robust 


dy/dx wrt: lcoins 


1._predict: Pr(1.cmdu), predict(pr outcome(1)) 
2._predict: Pr(2.cmdu), predict(pr outcome(2)) 
3._predict: Pr(3.cmdu), predict(pr outcome(3)) 
4._predict: Pr(4.cmdu), predict(pr outcome(4)) 
5._predict: Pr(5.cmdu), predict(pr outcome(5)) 
6._predict: Pr(6.cmdu), predict(pr outcome(6)) 
7._predict: Pr(7.cmdu), predict(pr outcome(7)) 


Delta-method 
dy/dx std. err. z P>I|zl [95% conf. interval] 
lcoins 
_predict 

1 .0301368 0018442 16.34 0.000 . 0265223 .0337513 
2 0017421 . 0003038 5.73 0.000 .0011467 . 0023376 
3 -.0079852 .0005111 -15.62 0.000 -.0089869 -.0069834 
4 - .0072376 -0004694 -15.42 0.000 -.0081575 -.0063176 
5 - .0054922 .0003726 -14.74 0.000 -.0062226 -.0047619 
6 -.0057985 -0004114 -14.09 0.000 -.0066049 -.0049921 
7 -.0053655 .0004143 -12.95 0.000 -.0061775 -.0045534 


Computing MEs is somewhat slower than model estimation; it involves 
numerical integration because, as explained in section 22.4.11, the ME depends upon 
the distribution of the individual-specific random effects. The estimated AME is 
obtained by averaging over the distribution of the random effects. 


The estimated AME, the change in the probability of outcome being in a 
particular category, varies by category. The effect is positive in the first two 
categories with two or fewer visits; the effect is relatively small and negative, but 
highly significant, in categories with three or more visits. This result illustrates that 
in an ordered model, the sign of the regressor does not necessarily reflect the sign of 
the ME in all categories. As expected, the AMEs sum to zero across the seven 
categories. 


Finally, we use the predict option to compare the fitted distribution with actual 
frequency distribution of outcomes. 


. * RE ordered logit: Estimate fitted average category probabilities 
. predict pr*, pr 
(using 12 quadrature points) 


. Summarize pr* 


Variable Obs Mean Std. dev. Min Max 
pri 20,186 . 3142163 .1071772 .0116773 .62827 
pr2 20,186 . 3187694 .0263436 .0582258 . 3359595 
pr3 20,186 . 1592766 .0282325 .0702338 . 1995985 
pr4 20,186 .0851783 .0264847 .0257301 . 149432 
prd 20,186 .0495467 .0212988 .0112889 . 1275895 
pré 20,186 .0424561 .0243255 .0073362 . 1875392 
pr? 20,186 .0305566 .0272974 .0035658 . 4295574 


The result shows that the fitted distribution of probability mass across categories 
is quite similar to that in the data displayed at the start of this section. 


The analysis of the same data using xtoprobit is left to the interested reader. 


22.5 Tobit and interval-data models 


We fit a panel tobit model for medical expenditures (med). Then the only 
panel estimator available is the RE estimator, which introduces a normally 
distributed individual-specific effect. Consistent estimation of an FE model is 
possible only if T > oo. 


22.5.1 Panel summary of the dependent variable 


For simplicity, we model expenditures in levels, though from section 19.4, 
the key assumption of normality for the tobit model is more reasonable for 
the natural logarithm of expenditures. 


The dependent variable, med, has within variation and between variation 
of similar magnitude because the xt sum command yields 


. * Tobit: Panel summary of dependent variable 
. xtsum med 


Variable Mean Std. dev. Min Max Observations 
med overall | 171.5892 698.2689 0 39182.02 N= 20186 

between 503.2589 O 19615.14 n= 5908 

within 526.269 -19395.28 20347.2 | T-bar = 3.41672 
22.5.2 RE tobit model 
The RE panel tobit model specifies the latent variable y;, to depend on 
regressors, an idiosyncratic error, and an individual-specific error, so 

* / 
Yit = Xith + Qi + Eit (22.10) 


where a; ~ N (0, o2) and c; ~ N (0, o2) and the regressor vector Xit 
includes an intercept. For left-censoring at L, we observe the Yit variable, 
where 


(22.11) 


= _ J Yk if ya > L 
cn) am) ae 


22.5.3 The xttobit command 


The xttobit command has a similar syntax to the cross-sectional tobit 
command. The 11 () option is used to define the lower limit for left- 
censoring, and the u1 () option is used to define the upper limit for right- 
censoring. The limit can be a variable, not just a number, so more generally 
we can have the limit L; rather than the limit L in (22.11). Like the RE logit 
model, estimation requires univariate numerical integration, using Gauss— 
Hermite quadrature. The vce (bootstrap) option can provide panel-robust 
standard errors. 


For our data, we obtain 


. * Tobit RE estimator 
. xttobit med lcoins ndisease female age lfam child, 11(0) nolog 


Random-effects tobit regression Number of obs = 20,186 

Uncensored = 15,733 

Limits: Lower = (0) Left-censored = 4,453 

Upper = +inf Right-censored = 0 

Group variable: id Number of groups = 5,908 
Random effects u_i ~ Gaussian Obs per group: 

min = 1 

avg = 3.4 

max = 5 

Integration method: mvaghermite Integration pts. = 12 

Wald chi2(6) = 573.45 

Log likelihood = -130030.45 Prob > chi2 = 0.0000 

med | Coefficient Std. err. Zz P>lz| [95% conf. interval] 

lcoins -31.10247 3.578498 -8.69 0.000 -38.1162 -24.08875 

ndisease 13.49452 1.139156 11.85 0.000 11.26182 15.72722 

female 60.10112 14.95966 4.02 0.000 30.78072 89.42152 

age 4.075582 . 7238253 5.63 0.000 2.656911 5.494254 

lfam -57 . 75023 14.68422 -3.93 0.000 -86.53077 -28.96968 

child -52.02314 24.21619 -2.15 0.032 -99.48599 -4.560284 

_cons -98.27203 36.05977 -2.73 0.006 -168.9479 -27.59618 

/sigma_u 371.3134 8.64634 42.94 0.000 354.3668 388.2599 

/sigma_e 715.1779 4.704581 152.02 0.000 705.9571 724.3987 

rho . 2123246 . 0086583 . 1957541 . 2296872 


LR test of sigma_u=0: chibar2(01) = 778.18 Prob >= chibar2 = 0.000 


About 22% of the observations are censored (4,453 of 20,186). All regressor 
coefficients are statistically significant and have the expected sign. The 
estimated standard deviation of the random effects a; is 371.3 and is highly 
statistically significant. The quantity labeled rho equals o2 /(a2 + 02) and 
measures the fraction of the total variance, o2 + a2, that is due to the 
random effects. In an exercise, we compare these estimates with those from 
the tobit command, which treats observations as independent over ; and t 
(so œ; = 0). The estimates are similar. 


22.5.4 The xtheckman command 


The xtheckman command extends the Heckman selection model heckman 
command to the panel case by adding individual-specific random errors to 
the outcome and selection equations. 


The cross-sectional selection model is presented in section 19.6. In the 
panel case, we specify the latent variables for selection and outcome to be, 
respectively, 


* / 
Vise = Xiitb1 + Ovi + Elit 


* / 
Yair = Xzitb2 + O25 + Erie 


The outcome Y2it is observed only when y7,, > 0. The individual- 
specific errors @1; and @2; are specified to be bivariate normally distributed, 
and the idiosyncratic errors €1:¢ and €2it are specified to be bivariate 
normally distributed with the normalization that Var(¢1;;) = 1. As in the 
cross-sectional case, it is desirable for some variables in X1;z to differ from 
those in X9;z to reduce the reliance on strong distributional assumptions. 


ML estimates can be obtained using the xtheckman command, which has 
a similar syntax to the cross-sectional heckman command. Estimation uses a 
computational algorithm related to Roodman (2011). There is no closed- 
form solution for the log likelihood, so estimation is by Gaussian quadrature. 
The intpoints (#) option allows many more integration points than the 
default of seven points. A higher number leads to greater precision and can 
lead to greater likelihood of convergence but adds to computational time. 
Convergence can be challenging if the identification of some of the 
parameters is poor or due to failure of the parametric assumptions. 


As an illustration, we model outpatient medical expenditures (outpdo1); 
convergence problems arose using variable mea, which is much more highly 
skewed because it additionally includes inpatient expenditures. To aid 
convergence, we make the strong (and questionable) exclusion restrictions 
that variables female, 1fam, child, hlthf (health status fair), h1thp (health 
status poor), and linc (log family income) do not appear in the outcome 
model. We set intpoints (15) and, to speed computation, use a 10% 
subsample. We obtain 


* Heckman sample-selection panel estimator 
outpdol > 0 


. gen doutpdol 


. xtheckman outpdol lcoins ndisease age if id > 629388, intpoints(10) nolog 


> select (doutpdol = lcoins ndisease age female lfam child hlthf hlthp linc) 
Random-effects regression with selection Number of obs = 2,016 
Selected = 1,286 
Nonselected = 730 
Group variable: id Number of groups = 638 
Obs per group: 
min = 1 
avg = 3.2 
max = 5 
Integration method: mvaghermite Integration pts. = 10 
Wald chi2(3) 58.44 
Log likelihood = -8425.1472 Prob > chi2 = 0.0000 
Coefficient Std. err. Zz P>|zl [95% conf. interval] 
outpdol 
lcoins . 5330196 1.358528 0.39 0.695 -2.129646 3.195685 
ndisease 1.9796 -4178257 4.74 0.000 1.160676 2.798523 
age -6824011 . 1496004 4.56 0.000 . 3891897 9756125 
_cons 29.81417 7.534019 3.96 0.000 15.04776 44.58057 
doutpdol 
lcoins -.2035518 . 0284604 -7.15 0.000 -.2593331 -.1477704 
ndisease . 0598368 010309 5.80 0.000 0396315 . 080042 
age -.0067661 .0057241 -1.18 0.237 -.0179851 . 004453 
female . 088402 . 1128548 0.78 0.433 -. 1327894 . 3095934 
lfam -. 4223696 . 1147663 -3.68 0.000 -.6473075 -.1974318 
child -.5411109 . 177272 -3.05 0.002 -.8885576 -.1936642 
hlthf .0395699 . 1946469 0.20 0.839 -.3419311 .4210708 
hlthp . 1131886 . 3852884 0.29 0.769 -.6419627 . 8683399 
linc . 1146879 . 0343245 3.34 0.001 0474131 . 1819627 
_cons - 4863538 - 410284 1.19 0.236 -.3177881 1.290496 
var (e.outpdol) 4286.03 271.7522 3785.17 4853.163 
corr(e.dout~1, 
e.outpdol) -.6172399 . 1096029 -5.63 0.000 -.7885422 -.3570515 
var (out ~1[id]) 1549.2 207.8337 1191.007 2015.12 
var (dou~1[id]) . 9505599 . 1499978 .6976866 1.295086 
corr ( 
doutpdol [id], 
outpdol [id] ) -4910298 . 134623 3.65 0.000 . 1874851 . 7089668 


The two correlation coefficients are quite precisely estimated, and they 
imply presence of endogenous selection. The individual-specific errors ( 
Q1;, @2:) are positively correlated with correlation 0.49. The idiosyncratic 


errors (E€1:¢, €2it) are negatively correlated. 


Additional output generated from this model can assist in further 


interpretation. We use the margins command to generate predicted mean 
outpatient expenditures, conditional on outpatient expenditures being 
positive, at two different ages with the other predictors evaluated at sample 


mean values. 


. * Predictive margins for outpdol given outpdol > O at different ages 


. margins, at(age=(20(40)60)) predict (ycond) 


Predictive margins 
Model VCE: OIM 


Expression: mean of outpdol, predict (ycond) 


1._at: age = 20 
2._at: age = 60 
Delta-method 
Margin std. err. z P>|z| 
_at 
1 56.58816 2.77992 20.36 0.000 
2 82.24204 6.621706 12.42 0.000 


Number of obs = 2,016 


[95% conf. 


51.13961 
69.26373 


interval] 


62.0367 
95.22034 


22.5.5 The xtintreg and xtfrontier commands 


The xtintreg command estimates the parameters of interval-data models 
where continuous data are reported only in ranges. For example, annual 
medical expenditure data may be reported only as $0, between $0 and $100, 
between $100 and $1,000, and more than $1,000. The unobserved 
continuous variable, y;,, is modeled as in (22.10), and the observed variable, 


Yit, arises as y;, falls into the appropriate range. 


Stochastic production frontier models introduce into the production 
function a strictly negative error term that pushes production below the 
efficient level. In the simplest panel model, this error term is time invariant 
and has a truncated normal distribution, so the model has some 


commonalities with the panel tobit model. The xt frontier command is 
used to estimate the parameters of these models. 


All four commands—xttobit, xtheckman, xtintreg, and xtfrontier— 
rely heavily on the assumption of homoskedastic normally distributed errors 
for consistency and, like their cross-sectional counterparts, are more fragile 
to distributional misspecification than, for example, linear models and logit 
models. 


22.6 Count-data models 


We fit count models for the number of doctor visits (mdu). Many of the 
relevant issues have already been raised for the xt logit command. One 
difference is that analytical solutions are possible for count RE models, by 
appropriate choice of (nonnormal) distribution for the random effects. A 
second difference is that Poisson panel estimators have the same robustness 
properties as Poisson cross-sectional estimators. They are consistent even if 
the data are not Poisson distributed, provided the conditional mean is 
correctly specified. At the same time, count data are often overdispersed, and 
the need to use heteroskedasticity-robust standard errors in the cross- 
sectional case carries over to a need to use panel—robust standard errors in 
the panel case. 


22.6.1 The xtpoisson command 


The xtpoisson command has the syntax 
xtpoisson depvar | indepvars | [ af | lin] | weight | [ ; options | 
The options include PA (pa), RE (re and normal), and FE (fe) models. 


Then PA, RE, and FE estimators are available for the Poisson model, with 
the xtpoisson command. Panel estimators are also available for the negative 
binomial model, with xtnbreg. 


22.6.2 Panel summary of the dependent variable 


The dependent variable, mdu, is considerably overdispersed because, from 
section 22.3.1, the sample variance of 4.502 — 20.25 is 7 times the sample 
mean of 2.86. This makes it very likely that default standard errors for both 
cross-sectional and panel Poisson estimators will considerably understate the 
true standard errors. 


The mdu variable has a within variation of magnitude similar to the 
between variation. 


. * Poisson: Panel summary of dependent variable 
. xtsum mdu 


Variable Mean Std. dev. Min Max Observations 
mdu overall 2.860696 4.504765 (0) TT N = 20186 
between 3.785971 (0) 63.33333 n= 5908 
within 2.575881 -34.47264 40.0607 T-bar = 3.41672 


To provide more detail on the variation in mdu over time, we look at 
transition probabilities, after first aggregating all instances of four or more 
doctor visits into a single category. We have 


. * Year-to-year transitions in doctor visits 
. generate mdushort = mdu 


. replace mdushort = 4 if mdu >= 4 
(4,039 real changes made) 


. xttrans mdushort 


mdushort 

mdushort (0) 1 2 3 4 Total 
(0) 58.87 19.61 9.21 4.88 7.42 100.00 

1 33.16 24.95 17.58 10.14 14.16 100.00 

2 23.55 24.26 17.90 12.10 22.19 100.00 

3 17.80 20.74 18.55 12.14 30.77 100.00 

4 8.79 11.72 12.32 11.93 55.23 100.00 

Total 31.81 19.27 13.73 9.46 25.73 100.00 


There is considerable persistence: over half of people with zero doctor visits 
one year also have zero visits the next year, and over half of people with four 
or more visits one year also have four or more visits the next year. 


We compute the correlations over time in the dependent variable. 


* Correlations over time in the dependent variable 
. pwcorr mdu Li.mdu L2.mdu L3.mdu L4.mdu 


mdu L.mdu L2.mdu L3.mdu L4.mdu 


mdu 1.0000 
L.mdu 0.6184 1.0000 
L2.mdu 0.4744 0.6029 1.0000 
L3.mdu 0.3714 0.4602 0.5995 1.0000 


L4.mdu 0.3820 0.3702 0.5054 0.6100 1.0000 


These correlations are dampening over time. Thus, the PA estimator used 
below will relax the assumption of equicorrelation used in the logit example 
to allow more flexible correlation. 


22.6.3 Pooled Poisson estimator 


The pooled Poisson estimator assumes that yit is Poisson distributed with a 
mean of 


E(yit|Xit) = exp(xa b) (22.12) 


as in the cross-sectional case. Consistency of this estimator requires that 
(22.12) be correctly specified but does not require that the data actually be 
Poisson distributed. If the data are not Poisson distributed, however, then it 
is essential that robust standard errors be used. 


The pooled Poisson estimator can be estimated by using the poisson 
command, with cluster—robust standard errors that take care of both 
overdispersion and serial correlation. We have 


. * Pooled Poisson estimator with cluster--robust standard errors 
. poisson mdu lcoins ndisease female age lfam child, vce(cluster id) 


Iteration 0: log pseudolikelihood = -62580.248 
Iteration 1: log pseudolikelihood = -62579.401 
Iteration 2: log pseudolikelihood = -62579.401 
Poisson regression Number of obs = 20,186 
Wald chi2(6) = 476.93 
Prob > chi2 = 0.0000 
Log pseudolikelihood = -62579.401 Pseudo R2 = 0.0609 
(Std. err. adjusted for 5,908 clusters in id) 
Robust 
mdu | Coefficient std. err. z P>Izl [95% conf. interval] 
lcoins - . 0808023 .0080013 -10.10 0.000 -.0964846 -.0651199 
ndisease . 0339334 . 0026024 13.04 0.000 . 0288328 . 039034 
female .1717862 .0342551 5.01 0.000 . 1046473 . 2389251 
age . 0040585 .0016891 2.40 0.016 . 000748 .0073691 
lfam -.1481981 . 0323434 -4.58 0.000 -.21159 -.0848062 
child . 1030453 0506901 2.03 0.042 . 0036944 . 2023961 
_cons . 748789 .0785738 9.53 0.000 . 5947872 . 9027907 


The importance of using cluster—robust standard errors cannot be 
overemphasized. For these data, the correct cluster-robust standard errors 
are 50% higher than the heteroskedasticity-robust standard errors and 300% 
higher than the default standard errors; see the end-of-chapter exercises. 
Here failure to control for autocorrelation and failure to control for 
overdispersion both lead to considerable understatement of the true standard 


errors. 


22.6.4 PA Poisson estimator 


The PA Poisson estimator is a variation of the pooled Poisson estimator that 
relaxes the assumption of independence of yit to allow different models for 


the correlation 


pis = Cor[{yst — exp(xj,8) }{yis — exp(x; B) YH 


This estimator is obtained by using the xtpoisson command with the pa 
option. Different correlation models are specified by using the corr () 


option; see section 8.4.3. Consistency of this estimator requires only that 
(22.12) be correct. But if the data are non-Poisson and are overdispersed, 
then the vce (robust) option should be used because otherwise default 
standard errors will understate the true standard errors. 


We use the corr (unstructured) option so that pts can vary freely over t 
and s. We obtain 


* Poisson PA estimator with unstructured error correlation and robust VCE 
. xtpoisson mdu lcoins ndisease female age lfam child, pa corr(unstr) vce(robust) 


Iteration 1: tolerance = .01585489 
Iteration 2: tolerance = .00034066 
Iteration 3: tolerance = 2.334e-06 
Iteration 4: tolerance = 1.939e-08 


GEE population-averaged model Number of obs = 20,186 
Group and time vars: id year Number of groups = 5,908 

Family: Poisson Obs per group: 
Link: Log min = 1 
Correlation: unstructured avg = 3.4 
max = 5 
Wald chi2(6) = 508.61 
Scale parameter = 1 Prob > chi2 = 0.0000 
(Std. err. adjusted for clustering on id) 

Robust 

mdu | Coefficient std. err. Zz P>|z| [95% conf. interval] 
lcoins -.0804454 .0077782 -10.34 0.000 -.0956904 -.0652004 
ndisease . 0346067 . 0024238 14.28 0.000 -0298561 . 0393573 
female . 1585075 .0334407 4.74 0.000 .0929649 . 2240502 
age 0030901 0015356 2.01 0.044 . 0000803 . 0060999 
lfam -.1406549 . 0293672 -4.79 0.000 -.1982135  -.0830962 
child . 1013677 .04301 2.36 0.018 .0170696 . 1856658 
_cons - 7764626 .0717221 10.83 0.000 . 6358897 .9170354 


The coefficient estimates are quite similar to those from pooled Poisson. The 
standard errors are as much as 10% lower, reflecting efficiency gain due to 
better modeling of the correlations. 


The estimated autocorrelation matrix is stored in e (R). We have 


. * Correlations over time 
. Matrix list e(R) 


symmetric e(R) [5,5] 


c1 c2 c3 c4 c5 
ri 1 
r2 = .53143297 1 
r3 .40817495 .58547795 1 
r4 .32357326 .35321716 .54321752 1 
r5 .34152288 .29803555 .43767583 .61948751 1 


The correlations do not drop greatly as the lag length increases, so the 
simpler equicorrelation may be a reasonable approximation. 


A more detailed comparison of estimators and methods to estimate the 
VCE (see the end-of-chapter exercises) shows that failure to use the 
vce (robust) option leads to erroneous standard errors that are one-third of 
the robust standard errors, and similar estimates are obtained by using the 
corr (exchangeable) Or corr (ar2) option. 


22.6.5 RE Poisson estimators 


The Poisson individual-effects model assumes that Yit is Poisson distributed 
with a mean of 


E(yitlai, Xit) = exp(yi + xB) =a; exp(x;,3) (22.13) 


where 7; = In a;, and here x; includes an intercept. The conditional mean 
can be viewed either as one with effects that are additive before 
exponentiation or as one with multiplicative effects. 


The standard Poisson RE estimator assumes that a; is gamma distributed 
with a mean of 1 and a variance of 7). This assumption has the attraction that 
there is a closed-form expression for the integral (22.4), so the estimator is 
easy to compute. Furthermore, then E (y;it|Xit) = exp(x/,3) so that 
predictions and MEs are easily obtained and interpreted. This is the 
conditional mean given in (22.12) for the PA and pooled models, so for the 
special case of the Poisson, and unlike the logit model, the Pa, pooled, and 


RE estimators of 3 have the same probability limit. Finally, the first-order 
conditions for the Poisson RE estimator, B: can be shown to be 
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(22.14) 


where A; = exp(x/,3) and à; = T~! X, exp(x/,3), so the estimator is 
, XiT) = Q; exp(x),3) because then the left- 


consistent if E (yit|@i, Xi1, - 
hand side of (22.14) has an expected value of 0. 


The RE estimator is obtained by using the xtpoisson command with the 
re option. We use the vce (robust) option to obtain panel-robust standard 
errors. We have 


* Poisson RE estimator with cluster--robust standard errors 
. xtpoisson mdu lcoins ndisease female age lfam child, re vce(robust) nolog 


Random-effects Poisson regression Number of obs = 20,186 
Group variable: id Number of groups = 5,908 

Random effects u_i ~ Gamma Obs per group: 
min = 1 
avg = 3.4 
max = 5 
Wald chi2(6) = 5407.91 
Log pseudolikelihood = -43240.556 Prob > chi2 = 0.0000 
(Std. err. adjusted for clustering on id) 

Robust 

mdu | Coefficient std. err. z P>lz| [95% conf. interval] 
lcoins -.0878258 .0079239 -11.08 0.000 -.1033563 -.0722952 
ndisease .0387629 . 0024087 16.09 0.000 .0340421 . 0434838 
female . 1667192 .0345869 4.82 0.000 .09893 . 2345083 
age .0019159 .0016533 1.16 0.247 -.0013244 0051563 
lfam -.1351786 . 0360629 -3.75 0.000 -.2058606 -.0644966 
child . 1082678 .0533869 2.03 0.043 .0036314 . 2129043 
_cons . 7574177 .0831229 9.11 0.000 . 5944999 . 9203355 
/lnalpha .0251256 . 0905423 -. 1523339 . 2025852 
alpha 1.025444 . 092846 .8587015 1.224564 
LR test of alpha=0: chibar2(01) = 3.9e+04 Prob >= chibar2 = 0.000 


Compared with the PA estimates, the RE coefficients are within 10%, and the 
RE cluster—robust standard errors are about 10% higher. The cluster—robust 
standard errors for the RE estimates are 20—50% higher than the default 
standard errors, not reported, so cluster—robust standard errors are needed. 
The problem is that the Poisson RE model is not sufficiently flexible because 
the single additional parameter, n, needs to simultaneously account for both 
overdispersion and correlation. Cluster—robust standard errors can correct for 
this, or the richer negative binomial RE model may be used. 


An alternative Poisson RE estimator assumes that y; = ln a; is normally 
distributed with a mean of 0 and a variance of o2, similar to the xt logit and 
xtprobit commands. Here estimation is much slower because Gauss— 
Hermite quadrature is used to perform numerical univariate integration. And 
similarly to the logit RE estimator, prediction and computation of MEs is 
difficult. This alternative Poisson RE estimator can be computed by using 
xtpoisson with the normal option. Estimates from this method are presented 
in section 22.6.7. 


The RE model permits only the intercept to be random. We can also allow 
slope coefficients to be random. This is the mixed-effects Poisson estimator 
implemented with mepoisson. The method is similar to that for melogit, 
presented in section 22.4.12. The method is computationally intensive. 


22.6.6 FE Poisson estimator 


The FE model is the Poisson individual-effects model (22.13), where a; is 
now possibly correlated with xj, and in short panels, we need to eliminate 
a; before estimating (3. 


These effects can be eliminated by using the conditional ML estimator 
based on a log density for the jth individual that conditions on Se i ie 


similar to the treatment of fixed effects in the logit model. Some algebra 
leads to the Poisson FE estimator with first-order conditions 


(22.15) 


3 Yom (me MH) = 


=l t=1 


where Ai = exp(x/,3) and à; = T~! Y, exp(x/,3). The Poisson FE 
estimator is therefore consistent if E (yit|&i, Xil,- - -, Xir) = ai exp(x;,B8) 
because then the left-hand side of (22.15) has the expected value of zero. 


The Poisson FE estimator can be obtained by using the xtpoisson 
command with the fe option. To obtain cluster-robust standard errors, we 
can use the vce (robust) option. We have 


* Poisson FE estimator with cluster--robust standard errors 
. xtpoisson mdu lcoins ndisease female age lfam child, fe vce(robust) 


note: 
note: 
note: 
note: 
note: 


265 groups (265 obs) dropped because of only one obs per group 
666 groups (2130 obs) dropped because of all zero outcomes 
lcoins dropped because it is constant within group 

ndisease dropped because it is constant within group 

female dropped because it is constant within group 


Iteration 0: log pseudolikelihood = -24182.852 


Iteration 1: log pseudolikelihood = -24173.211 
Iteration 2: log pseudolikelihood = -24173.211 
Conditional fixed-effects Poisson regression Number of obs = 17,791 
Group variable: id Number of groups = 4,977 
Obs per group: 
min = 2 
avg = 3.6 
max = 5 
Wald chi2(3) = 4.58 
Log pseudolikelihood = -24173.211 Prob > chi2 = 0.2051 
(Std. err. adjusted for clustering on id) 
Robust 
mdu | Coefficient std. err. Zz P>lz| [95% conf. interval] 
age -.0112009 .0091493 -1.22 0.221 -.0291331 .0067314 
lfam .0877134 . 1160837 0.76 0.450 -. 1398064 . 3152332 
child . 1059867 .0786326 1.35 0.178 - .0481304 . 2601037 


Only the coefficients of time-varying regressors are identified, similar to 
other FE model estimators. The Poisson FE estimator requires that there be at 
least two periods of data, leading to a loss of 265 observations, and that the 
count for an individual be nonzero in at least one period Opes | Yit > 0) 


leading to a loss of 666 individuals because mau equals 0 in all periods for 
666 people. The cluster—robust standard errors are roughly two times those 
of the default standard errors; see the end-of-chapter exercises. In theory, the 


individual effects, @;, could account for overdispersion, but as is common, 
they do not completely do so. The standard errors are also roughly twice as 
large as the PA and RE standard errors, reflecting a loss of precision due to 
using only within variation. 


For the FE model, results should be interpreted on the basis of 
E(yit|oi, Xit) = ai exp(x), 3). The predict command with the nuo option 
gives predictions when y; = 0 so a; = 1, and the margins, dydx() 
command with the predict (nu0) option gives the corresponding MEs. If we 
do not want to consider only the case of a; = 1, then the model implies that 
OE (Yita, Xi) /OX; it = Bj x E(yitlau, Xit) , SO B; can still be interpreted 
as a semielasticity. 


Given the estimating equations given by (22.15), the Poisson FE 
estimator can be applied to any model with multiplicative effects and an 
exponential conditional mean, essentially whenever the dependent variable 
has a positive conditional mean. Then the Poisson FE estimator uses the 
quasi-difference, y;, — (Aj4/A;)y,; Whereas the linear model uses the mean 
difference, Yit — Y;. 


In the linear model, one can instead use the first difference, Yit — Yi,t—1, 
to eliminate the fixed effects, and this has the additional advantage of 
enabling estimation of FE dynamic linear models using the Arellano—Bond 
estimator. Similarly, here one can instead use the alternative quasi- 
difference, (A; t—1/Ait Yit — Yi,t—1, to eliminate the fixed effects and use 
this as the basis for estimation of dynamic panel count models; for details 
see, for example, Cameron and Trivedi (2013, chap. 9). 


The correlated RE model, presented in section 22.2.4, provides an 
alternative way to control for FE. Estimates for this model are presented 
below. 


22.6.7 Panel Poisson estimators comparison 


We compute several panel Poisson estimators, with panel-robust VCE used 
for all estimators. 


* Compute various Poisson panel estimators 
global xlist age lfam child lcoins ndisease female 


qui xtpoisson mdu $xlist, pa corr(unstr) vce(robust) 

estimates store PPA_ROB 

qui xtpoisson mdu $xlist, re vce(robust) 

estimates store PRE 

qui xtpoisson mdu $xlist, re normal vce(robust) 

estimates store PRE_NORM 

qui xtpoisson mdu $xlist, fe vce(robust) 

estimates store PFE 

qui xtpoisson mdu $xlist aveage avelfam avechild, re normal vce(robust) 


estimates store CRE 


The estimates are 


. * Report various Poisson panel estimators - results 
. estimates table PPA_ROB PRE PRE_NORM PFE CRE, equations(1) b(%8.4f) 
> se stats(N 11) stfmt(%8.0f) 


Variable PPA_ROB PRE PRE_NORM PFE CRE 
#1 
age 0.0031 0.0019 0.0027 0.0112 -0.0112 
0.0015 0.0017 0.0017 0.0091 0.0092 
lfam -0.1407 -0.1352 -0.1443 0.0877 0.0872 
0.0294 0.0361 0.0359 0.1161 0.1158 
child 0.1014 0.1083 0.0737 0.1060 0.1059 
0.0430 0.0534 0.0534 0.0786 0.0786 
lcoins -0.0804 -0.0878 -0.1145 -0.1152 
0.0078 0.0079 0.0072 0.0072 
ndisease 0.0346 0.0388 0.0409 0.0405 
0.0024 0.0024 0.0023 0.0023 
female 0.1585 0.1667 0.2084 0.2054 
0.0334 0.0346 0.0310 0.0310 
aveage 0.0134 
0.0093 
avelfam -0.2848 
0.1202 
avechild -0.0545 
0.0958 
_cons 0.7765 0.7574 0.2873 0.3807 
0.0717 0.0831 0.0829 0.0778 
/l1nalpha 0.0251 
0.0905 
/lnsig2u 0.0550 0.0550 
0.0271 0.0270 
Statistics 

N 20186 20186 20186 17791 20186 
11 -43241 -43227 -24173 -43210 


Legend: b/se 


The PA and RE parameter estimates are quite similar; the alternative RE 
estimates based on normally distributed random effects are roughly 
comparable, whereas the FE and CRE estimates for the time-varying 
regressors are quite different. 


22.6.8 Panel probit prediction and MEs 


The predict postestimation command has several options that vary 
depending on whether the xt poisson command was used with the pa, re, or 
fe option. 


After the xtpoisson, pa command, the default predict option is mu, 
which gives the predicted conditional mean. After the xtpoisson, re 
command, the default prediction option xb gives x’, 3; option 
expression (exp (predict (xb) )) used with margins gives exp(x/, 3): After 
the xtpoisson, re normal command, the default option n estimates 
E(yit|Xit) = exp(x),3), which requires numerically integrating out a;. 
After the xtpoisson, fe command, the default prediction option xb gives 
x’ Nc} while the default option nuo gives exp(x’, B) by setting a; = 1 or 
equivalently y; = ln œ; = 0. 


The default options for the margins command can differ in some cases 
from those for the predict command. The following commands explicitly 
state what option is being used in obtaining the AMEs for various panel 
Poisson models. 


* Compute AMEs for PA, RE, FE, and CRE models 
. qui xtpoisson mdu $xlist, pa corr(unstr) vce(robust) 


. qui margins, dydx(*) predict (mu) 

. matrix PA = r(b) 

. qui xtpoisson mdu $xlist, re vce(robust) 

. qui margins, dydx(*) expression(exp(predict (xb) )) 
. matrix RE = r(b) 

. qui xtpoisson mdu $xlist, re normal vce(robust) 

. qui margins, dydx(*) predict (n) 

. matrix REnorm = r(b) 

. qui xtpoisson mdu $xlist, fe vce(robust) 

. qui margins, dydx(*) predict (nu0) 

. matrix FE_O = r(b) 

. qui xtpoisson mdu $xlist aveage avelfam avechild, re normal vce(robust) 
. qui margins, dydx(*) predict (n) 

. matrix CREall = r(b) 

. Matrix CRE = CREall[1i..1, 1..6] 

. qui margins, dydx(*) predict (nu0) 

. matrix CRE_Oall = r(b) 

. Matrix CRE_O = CRE_Oall1[1..1, 1..6] 


The AMEs are as follows. 


. * Report AMEs for PA, RE, FE, and CRE models 
. matrix rowjoinbyname ME = PA RE REnorm FE_O CRE CRE_O 


. matrix rownames ME = PA RE REnorm FE_O CRE CRE_O 
matrix list ME, format (%48.4f) 


ME[6,6] 
age lfam child lcoins ndisease female 
PA 0.0089 -0.4067 .2931 -0.2326 0.1001 0.4584 
RE 0.0056 -0.3924 .3143 -0.2549 0.1125 0.4839 
REnorm 0.0082 -0.4442 .2269 -0.3525 0.1258 0.6415 
FE _O -0.0100 0.0784 .0948 ; ; ; 
CRE -0.0346 0.2692 .3267 -0.3556 0.1250 0.6338 
CRE_O -0.0204 0.1587 .1926 -0.2097 0.0737 0.3737 


oo0oo0oo0oo0o 


The AMEs for PA and RE estimators are roughly comparable. The ames for the 
FE estimator are quite different. The CRE AMEs have the same sign as the FE 
AMEs but differ quite substantially in magnitude. The second set of CRE AMES 
still differ from the FE AMEs even though both evaluate at y; = Ina; = 0. 


22.6.9 Negative binomial estimators 


The preceding analysis for the Poisson can be replicated for the negative 
binomial. The negative binomial has the attraction that, unlike Poisson, the 
estimator is designed to explicitly handle overdispersion, and count data are 
usually overdispersed. This may lead to improved efficiency in estimation 
and a default estimate of the vce that should be much closer to the cluster- 
robust estimate of the vce, unlike for Poisson panel commands. At the same 
time, the Poisson panel estimators rely on weaker distributional assumptions 
—essentially, correct specification of the mean—and it may be more robust 
to use the Poisson panel estimators with cluster—robust standard errors. 


For the pooled negative binomial, the issues are similar to those for 
pooled Poisson. For the pooled negative binomial, we use the nbreg 
command with the vce (cluster id) option. For the PA negative binomial, 
we can use the xtnbreg command with the pa and vce (robust) options. 


For the panel negative binomial RE, we use xtnbreg with the re option. 
The negative binomial RE model introduces two parameters in addition to G 
that accommodate both overdispersion and within correlation. An estimator 
called the negative binomial FE estimator is no longer viewed as being a true 
FE estimator because, for example, it is possible to estimate the coefficients 
of time-invariant regressors in addition to time-varying regressors. A more 
complete presentation is given in, for example, Cameron and Trivedi (2013, 
chap. 9) and in [xT] xtnbreg. 


We apply the Poisson PA and negative binomial PA and RE estimators to 
the doctor-visits data. We have 


. * Comparison of negative binomial panel estimators 
. qui xtpoisson mdu lcoins ndisease female age lfam child, pa 
> corr(exch) vce(robust) 


. estimates store PPA_ROB 


. qui xtnbreg mdu lcoins ndisease female age lfam child, pa 
> corr(exch) vce(robust) 


. estimates store NBPA_ROB 
. qui xtnbreg mdu lcoins ndisease female age lfam child, re 


. estimates store NBRE 


. estimates table PPA_ROB NBPA_ROB NBRE, eq(1) b(%8.4f) se 


> stats(N 11) stfmt(%8.0f) 
Variable PPA_ROB NBPA_ROB NBRE 
#1 
lcoins -0.0815 -0.0865 -0.1073 
0.0079 0.0078 0.0062 
ndisease 0.0347 0.0376 0.0334 
0.0024 0.0023 0.0020 
female 0.1609 0.1649 0.2039 
0.0338 0.0343 0.0263 
age 0.0032 0.0026 0.0023 
0.0016 0.0016 0.0012 
lfam -0.1487 -0.1633 -0.1434 
0.0299 0.0291 0.0251 
child 0.1121 0.1154 0.1145 
0.0444 0.0452 0.0385 
_cons 0.7755 0.7809 0.8821 
0.0724 0.0730 0.0663 
Jin 1.1280 
0.0269 
/ln_s 0.7259 
0.0313 
Statistics 
N 20186 20186 20186 
11 -40661 


Legend: b/se 


The Poisson and negative binomial PA estimates and their standard errors are 
similar. The RE estimates differ more and are closer to the Poisson RE 
estimates given in section 22.6.4. 


22.7 Panel quantile regression 


We covered linear conditional QR analysis of cross-sectional data in 

chapter 15. Pooled conditional QR is straightforward. Extension to panel data 
with individual-specific effects remains in developmental stages. A standard 
method for controlling for unobserved individual specific heterogeneity is by 
introducing an additive random effect with known distribution and then 
integrating this out. This approach will not work, because QR is a 
distribution-free estimator. Instead, we present FE estimation. 


22.7.1 Pooled QR estimator 


Pooling or population averaging introduces clustering as an additional new 
element to be accounted for in obtaining robust variance estimates. 

Section 15.3.6 presented the community-contributed qreg2 command, a 
wrapper for qreg that generates standard errors and ¢ statistics that are 
heteroskedastic robust (the default) or cluster—robust (option cluster () ). 
Another recently proposed method of robust variance estimation is the wild 
gradient bootstrap by Hagemann (2017), but we are not aware of any 
supporting Stata software. 


We begin with the standard QR analysis, treating the sample as a pooled 
sample of cross-sectional observations and obtaining cluster—robust standard 
errors. Application is to the panel dataset on log wage used in chapter 8 with 
N = 595 and T = 7. For all calculations, we chose q = .25 and q = .75. 


. * Pooled data quantile regression ignoring individual effects 
. qui use mus208psid, clear 


. qui qreg2 lwage exp exp2 wks ed, quant(.25) cluster(id) 
. estimates store Pool25 
. qui qreg2 lwage exp exp2 wks ed, quant(.75) cluster(id) 


. estimates store Pool75 


The estimates will be compared below with FE estimates. From output not 
given, the cluster—robust standard errors are 50%—100% larger than 
heteroskedastic—robust standard errors. 


22.7.2 Panel QR with fixed effects 


Several different approaches to conditional QR with FE models have been 
proposed; Canay (2011) provides a brief summary and discusses the 
limitations of these approaches. He proposes a simple estimator in the 
special case that T — oo, in addition to our standard assumption that 

N — oo. While requiring T — œ is a limitation, he presents Monte Carlo 
evidence that the method works well for T = 20 and possibly lower. 


The method assumes a linear regression with an additive separable 
individual effect (@:) and an additive independent and identically distributed 
random error €it. We estimate the linear regression of Yit on vector Xit and 
derive the regression residuals, which are treated as an estimate of (a; + Ei). 
The individual specific average of the residuals is a consistent estimate of 
the fixed effects, denoted &;, provided T — oo. This average is subtracted 
from the dependent variable, and the QR analysis is applied to the 
transformed variable. 


The two-step estimation method is simple to implement. More difficult is 
the computation of standard errors because the sandwich term in the 
variance matrix is the sum of four terms. Canay (2011) conjectures that one 
can apply the bootstrap pairs method to the two-step estimator (see 
section 12.4.5) and provides supporting Monte Carlo evidence. 


For illustration, we apply the method to the log wage data, even though 
T = 71s quite small. We first create the transformed dependent variable. 


. * FE QR: (1) Filter out the individual FE from the y variable 
. qui regress lwage exp exp2 wks ed 


. predict residuals_i, resid 

. sort id 

. by id: egen alphahat_i=mean(residuals_i) 
. Summarize alphahat_i 


Variable | Obs Mean Std. dev. Min Max 


4,165 1.94e-10 3257519 -.9955915 . 9009727 
. gen lwagehat=lwage-alphahat_i 


alphahat_i 


We then obtain the second-stage estimates. The associated standard 
errors are obtained by a simpler bootstrap that bootstraps only the second 
stage of the estimation procedure. This does not adjust for the imprecision 
due to first-stage estimation of the fixed effects; the correct bootstrap should 
bootstrap both stages. 


* FE QR: (2) QR of new y with bootstrap covariance estimates 
set seed 10101 


qui bsqreg lwagehat exp exp2 wks ed, quant(.25) reps(400) 
estimates store FE25boot 
qui bsqreg lwagehat exp exp2 wks ed, quant(.75) reps(400) 
estimates store FE75boot 

. xtset id 

Panel variable: id (balanced) 


qui bootstrap, reps(400) seed(10101) cluster(id): 
> qreg lwagehat exp exp2 wks ed, quant(.25) 


estimates store FE25clu 


qui bootstrap, reps(400) seed(10101) cluster(id): 
> qreg lwagehat exp exp2 wks ed, quant(.75) 


estimates store FE75clu 


Finally, all results are collected and tabulated for comparison. 


. * Compare OLS, FE, Pooled_QR (0.22., 0.75), and Panel_QR(0.25, 0.75) 
. estimates table Pool25 FE25boot FE25clu Pool75 FE7S5boot FE75clu, b(%8.4f) se 


Variable Pool25 FE25boot FE25clu Pool75 FE75boot FE75clu 
exp 0.0488 0.0498 0.0498 0.0406 0.0551 0.0551 
0.0067 0.0016 0.0015 0.0059 0.0018 0.0019 

exp2 -0.0008 -0.0008 -0.0008 -0.0006 -0.0008 -0.0008 
0.0002 0.0000 0.0000 0.0001 0.0000 0.0000 

wks 0.0049 0.0031 0.0031 0.0068 0.0027 0.0027 
0.0023 0.0009 0.0009 0.0032 0.0008 0.0009 

ed 0.0822 0.0796 0.0796 0.0737 0.0807 0.0807 
0.0055 0.0017 0.0015 0.0063 0.0016 0.0015 

_cons 4.5858 4.7977 4.7977 5.1834 4.9962 4.9962 
0.1521 0.0578 0.0550 0.1932 0.0483 0.0488 


Legend: b/se 


Broadly summarized, the results indicate that FE adjustment changed the 
pooled estimation results by not much, aside from the coefficient of ed. The 


standard errors from panel FE estimations are considerably smaller than those 
from pooled estimation. A similar large reduction in the standard errors was 
found in section 8.8.3 for linear regression using the same data, so this is not 
solely due to our failure here to allow for first-stage estimation of the fixed 
effects. The clustered and nonclustered bootstraps yielded similar standard 
errors. Panel conditional QR models with fixed effects cause a problem 
because inclusion of individual fixed effects changes the interpretation of the 
treatment-effect coefficient. The community-contributed gregpd command, 
developed by Baker (2016), is one option. 


22.8 Endogenous regressors in nonlinear panel models 


The combination of nonlinearity and endogenous regressors in panel 
models poses a difficult specification and estimation problem for fully 
parametric models. Thus, methods typically place some restrictions on the 
model. For example, the model structure may be restricted to permit only 
recursive dependence (see section 7.2.3); only RE models might be 
considered; only restricted forms of unobserved heterogeneity may be 
admitted; and so forth. 


RE models that allow for endogeneity and sample selection can be 
estimated using the extended regression model commands xteregress, 
xteprobit, xteoprobit, and xtintreg; see section 23.7. This approach is 
restricted to recursive models with normal errors. 


Semiparametric methods are easier to use. Specifically, in 
section 20.7.3, we presented Iv methods for handling endogenous regressors 
in a count-data model with cross-sectional data. The same approach can be 
used also for panel data. That is, we treat panels as pooled cross-sections 
and apply nonlinear generalized method of moments (GMM) (or instrumental 
variables) based on a moment condition. Given valid instruments, the 
estimator is consistent, but such a solution is generally not efficient, 
because it often ignores salient features of the data. The [R] gmm manual 
entry provides code for Poisson panel regression with endogenous 
regressors; see also Cameron and Trivedi (2013, chap. 9). This nonlinear 
GMM approach can be applied to other cases, such as panel logit or probit 
with endogenous regressors. 


Alternative model-specific methods accommodate complications such 
as the outcome variable is a count, possibly with excess zeros; an 
endogenous regressor may be binary or count with excess zeros; counted 
outcomes may be conditionally correlated or clustered; or unobserved 
heterogeneity is not explicitly acknowledged. Such data features are central 
to panel models, and hence ignoring them is unsatisfactory. Furthermore, it 
is particularly important to robustify standard error estimates. 


22.9 Additional resources 


The Stata panel commands cover the most commonly used panel methods, 
especially for short panels. This topic is exceptionally vast, and there are 
many other methods that provide less-used alternatives to the methods 
covered in Stata as well as methods to handle complications not covered in 
Stata, especially the joint occurrence of several complications such as a 
dynamic FE logit model. Many of these methods are covered in the panel- 
Lee (2002); see also Rabe-Hesketh and Skrondal (2022) for the mixed- 
model approach and Cameron and Trivedi (2013, chap. 9) for count-data 
models. Cameron and Trivedi (2005) and Wooldridge (2010) also cover 
some of these methods. 


22.10 Exercises 


1. Consider the panel logit estimation of section 22.4. Compare the 
following three sets of estimated standard errors for the pooled logit 
estimator: default, heteroskedasticity robust, and cluster-robust. How 
important is it to control for heteroskedasticity and clustering? Show 
that the pa option of the xt logit command yields the same estimates 
as the xtgee command with the family (binomial), link (logit), and 
corr (exchangeable) options. Compare the PA estimators with the 
corr (exchangeable), corr(ar2), and corr (unstructured) options, 
in each case using the vce (robust) option. 

2. Consider the panel logit estimation of section 22.4. Drop observations 
with id > 125200. Estimate the parameters of the FE logit model by 
using xt logit as in section 22.4. Then, estimate the parameters of the 
same model by using logit with Dvs for each individual (so use 
xi: logit with regressors, including i.id). This method is known to 
give inconsistent parameter estimates in a short panel. Compare the 
estimates with those from command xtlogit. Are the same parameters 
identified? 

3. For the parameters of the panel logit models in section 22.4, estimate 
by using xt logit with the pa, re, and fe options. Compute the 
following predictions: for pa, use predict with the mu option; for re, 
use predict with the puo option; for pa, use predict with the puo 
option. For these predictions and for the original dependent variable, 
dmdu, compare the sample average value and the sample correlations. 
Then, use the margins, dydx() command with these predict options, 
and compare the resulting MEs for the 1coins variable. 

4. Generate a panel dataset using the commands below. Explain why the 
sample is generated by the probit RE model with normally distributed 
errors. Obtain the probit RE estimates. Are the coefficient estimates 
what you expect? Explain. Next, obtain the probit PA estimates. Do you 
obtain similar parameter estimates? Explain. Next, obtain the predicted 
probabilities from the two models, and OLS regress one prediction on 
the other. Comment. Then, obtain the MEs at x1 = 1 for the two 
models and compare. Comment. Provide a summary comparing the PA 


model with the RE model. Finally, instead repeat, except generate e= 
(rchi2 (4) -4) /sqrt (2*4). Comment on any major changes. 


* Generate panel data sample for probit regression 
clear all 

set obs 10000 

set seed 10101 

generate id = _n 

generate double alpha = rnormal(0,1) 
generate double z1 = rnormal (0,1) 
generate double z2 zi + rnormal (0,1) 
expand 3 

sort id 

by id: gen t = _n 

generate double x1 = z1 + rnormal(0,1) 
generate double x2 z2 + rnormal (0,1) 
generate double xb = 1 + 1*x1 + 1*x2 
generate double e = rnormal(0,1) 
generate double u = e + alpha 
generate double y = xb + u > 0 


. For the panel tobit model in section 22.5, compare the results from 
xttobit with those from tobit. Which do you prefer? Why? 

. Consider the panel Poisson estimation of section 22.6. Compare the 
following three sets of estimated standard errors for the pooled Poisson 
estimator: default, heteroskedasticity robust, and cluster-robust. How 
important is it to control for heteroskedasticity and clustering? 
Compare the PA estimators with the corr (exchangeable), corr(ar2), 
and corr (unstructured) options, in each case using both the default 
estimate of the VCE and the vce (robust) option. 

. Consider the panel count estimation of section 22.6. To reduce 
computation time, use the drop if id > 127209 command to use 
10% of the original sample. Compare the standard errors obtained by 
using default standard errors with those obtained by using the 

vce (bootstrap) option for the following estimators: Poisson RE, 
Poisson FE, negative binomial RE, and negative binomial FE. How 
important is it to use panel—robust standard errors for these estimators? 
. Repeat the analysis of section 22.7.2 with the following changes: 1) set 
q = 0.4; 2) use the default (not bootstrap) standard errors. Compare 
the results for greg, qreg2, and Canay’s FE estimator. 


Chapter 23 
Parametric models for heterogeneity and 
endogeneity 


23.1 Introduction 


Two topics—unobserved heterogeneity and endogeneity—have featured 
throughout the book, both individually and in combination. Their treatment 
was for the most part in the context of subject matter dealing with specific 
models, such as binary outcome, tobit, and event count models. 


In this chapter, we deal with these topics from a command-based 
perspective in which the unifying feature is the underlying assumptions and 
computational methodology used to handle these complications that are 
common across very wide classes of models. 


We cover the fmm prefix, various me commands, the sem and gsem 
commands, and the commands for extended regression models (ERMs). 
These commands implement methods that are generally fully parametric 
and in many cases are based on joint normally distributed model errors. 


The fmm (finite mixture model) prefix and me (mixed-effects) commands 
provide widely used methodologies for models with unobserved 
heterogeneity. The prefix fmm was introduced and applied in 
sections 14.2 and 20.5 and uses a discrete distribution for unobserved 
heterogeneity. The me commands are for various nonlinear models that use 
the continuous normal distribution to model unobserved heterogeneity; the 
related command mixed was presented for linear models in section 6.7. 


The SEM (structural equation model) commands cover a wide range of 
systems of equations that are linear in fully observed variables and latent 
variables. Examples include seemingly unrelated regressions (SUR), 
simultaneous equations, factor analysis, and linear mixed models. These 
models can include endogenous regressors. 


The commands for ERMs cover continuous, binary, ordered discrete, and 
interval outcomes. Options allow for endogeneity and binary sample 
selection or tobit sample selection, topics jointly considered in section 17.9. 
Additional options of the ERM commands implement some of the treatment- 
effects methods presented in chapters 24 and 25. 


In the remainder of this chapter, we shall deal sequentially with the 
aforementioned commands, giving relatively more space to applications and 
features of models that have not been covered earlier in the book. Given the 
similarity of the syntax for different models, and given also the similarity of 
the motivation underlying the methodology, we keep the discussion brief, 
especially in those cases where a more detailed coverage has been given 
previously for selected models, such as for linear regression and count 
regression. 


23.2 Finite mixtures and unobserved heterogeneity 


Unobserved heterogeneity is pervasive in microeconometric models. It calls for 
special attention in panel-data regressions, regression with within-cluster 
correlation, survival data models, and treatment evaluation. 


A common representation of unobserved heterogeneity is a random-effects 
model in which the model error includes a panel-specific or cluster-specific 
additive independent and identically distributed (1.1.d.) random variable that is 
uncorrelated with other random variables. Both parametric [for example, 
maximum likelihood (ML)] and semiparametric methods (for example, 
generalized least squares) for handling these cases have appeared in various 
sections of this book. 


A richer representation of unobserved heterogeneity postulates randomness 
in parameters. This representation is neither additive nor separable. The 
heterogeneity may be represented by either a continuous or a discrete 
distribution. The mixed, sem, gsem, and ERM commands usually characterize 
randomness by a continuous distribution, often the normal distribution; an 
exception is the 1class() option of the gsem command, which uses a discrete 
distribution. The fmm prefix for finite mixture models always characterizes 
parameter heterogeneity using a discrete distribution. 


23.2.1 Finite mixtures model 


A parametric finite mixture model (FMM) is constructed by taking a convex 
combination of a finite number of distributions (components) with different 
parameters. A mixture distribution does not belong to the same class as the 
component distribution; for example, a mixture of normals is not normal. In 
practice, one does not know either the mixing parameters or which component 
distribution any particular observation comes from. Hence, the components are 
often referred to as latent classes, and the FMM is also known as a latent class 
model. 


There are several motivations for using the FMM framework. It has many 
more parameters relative to the benchmark model with a single distribution and 
hence has greater flexibility. It can be interpreted as a way of generating good 


approximations to unknown distributions. In specific cases, it can support 
insightful interpretations of differences in subpopulations. 


A parametric finite mixture regression model with C components is written 
as 


C 
F (yilxi, 2a) =y 15 a n, t= Lag Ni J=] C231) 
j=l 


where for the jth component the usual regressors are denoted as Xj, 0; refers to 
the unknown j-class parameters, and the class probability 7; is a function of 
observable variables 2:; that determines the weight of the 7th component. The 
number of mixture components, C, is usually unknown and is treated as an 
additional parameter. 


The component probabilities should satisfy 0 < mi; < 1 and sae tig Sk 
To ensure that these constraints are satisfied, one should further parameterize 
the 7 functions as a multinomial logit model, with 


exp(Z;;Y;) 


13 (Zig) = SE {1 + exp(ziy)} 


This simplifies to a logit model in the common case that C = 2. Often, a 
simpler FMM with constant class probabilities is estimated; then 7; (z;;) = 7). 


In this specification, variation in X; has an impact on the outcome that 
varies depending upon the latent class. Moreover, component-specific 
restrictions can be imposed. For example, exclusion restrictions on either or 
both x variables or z variables may vary across the Œ components. 


FMMs account for between-class heterogeneity in responses. A more general 
model additionally allows for unobserved heterogeneity within each component. 
Then the distribution of the component-specific outcome yi; depends upon both 
exogenous variables Xi; and on some individual-specific continuously 
distributed unobserved factor, say, €;. For example, we may specify 


c 
Farze) = >, a a Fg ea] 


j=l 


In this richer specification, there are now two types of unobserved 
heterogeneity, one discrete and the other continuous. The standard assumption is 
that the two types are independent of each other and independent of the 
regressors; hence, the model is a random-effect-type specification. An example 
is the mixture of negative binomials studied in section 20.5.5; formally, this is a 
discrete two-component mixture of negative binomials, each of which can be 
interpreted as a (continuous) Poisson-gamma mixture. 


Adding an additional component can considerably increase the number of 
parameters estimated. Models with a number of components that lead to the 
lowest value of the Akaike or Bayes information criterion (AIC and BIC), defined 
in section 13.8.2, are preferred. BIC has a bigger penalty for additional 
parameters. These criteria are used rather than a formal statistical test of 
whether to add an additional component. 


23.2.2 The fmm prefix 


The syntax of the fmm prefix is detailed in section 14.2.5. 


For convenience, we repeat the generic syntax, 
fmm #: model depvar | indepvars | [ i options | 


where # refers to the number of components in the specification and model is an 
estimation command that will vary with the specified distribution. 


The syntax for the more general hybrid version of the model that allows 
parameterization of the latent class probabilities is as follows, 


fmm #: model depvar indepvars| , lcprob(varlist) options | 


where varlist refers to the variables that are used to parameterize the latent class 
probabilities. 


An important option is the vce () option. FMMs are estimated by ML under 
the assumption of completely correct model specification. For some of these 
models, estimators are inconsistent if the 1.1.d. assumption fails, even if other 
aspects of the model are correctly specified. We nonetheless report robust 
standard errors in such cases because these still give a better estimate of 
estimator precision; see section 13.3.1. Because FMMs are more flexible models 
and incorporate heteroskedasticity, we expect them to reduce any difference 
between default and heteroskedastic—robust standard errors. 


A detailed FMM application for the linear regression model under normality 
is presented in section 14.3, and a count application is given in section 20.5. The 
command can be applied to many other commonly used parametric model 
commands, such as fmm: logit for a logit model. The command covers 
censored, truncated, and interval linear regression (tobit, truncreg, intreg); 
instrumental variables (ivregress); binary outcomes (logit, probit, cloglog); 
multinomial outcomes (mlogit, ologit, oprobit); counts (poisson, nbreg, 
tpoisson); durations (streg); beta regression (betareg); and generalized linear 
models (gim). 


There is a unified and robust computational algorithm for fitting these 
models by an ML algorithm that is a two-part hybrid consisting of the 
expectation-maximization algorithm in the first part and a gradient-based 
Newton—Raphson-type algorithm in the second. Convergence of the algorithm 
is not guaranteed in general and is likely to be case dependent. Models with 
more components, and hence more parameters, are more challenging. Models 
with many components, including one or more with a small subpopulation, will 
be harder to identify. 


23.3 Empirical examples of FMMs 


The use of the finite mixture framework helps modeling and interpretation of 
heterogeneous responses in several ways. First, we can test whether the 
latent class framework leads to an improvement in the fit of the model. 
Second, we can make comparisons across latent classes between the 
outcomes’ responses, as reflected in coefficients, marginal effects (ME), and 
so forth, resulting from changes in the regressors. Third, if the latent classes 
can be identified by their observable characteristics, then interpretation of 
the results potentially can be more informative in the sense that each latent 
class refers to a specific subpopulation. 


We present a series of nonlinear regression examples that may be viewed 
as FMM extensions of FMM examples already used in sections 14.2—14.3 
and 20.5. The first example is presented in greater detail than the remaining 
examples. 


23.3.1 Example 1: Gamma regression for expenditures 


The distribution of medical expenditures is well known to be thick tailed and 
right skewed and to have significant nonnormal kurtosis, features that pose 
difficulties in both modeling and forecasting. One well-established empirical 
approach is to apply a log transformation to positive expenditures and then 
assume a normal distribution. Another approach is to not apply the 
transformation and instead assume that the conditional distribution is 
gamma. By better accounting for large expenditures, the gamma model may 
provide a better fit to the data than the log-normal, and the gamma model 
has the advantage of directly modeling the level of expenditure. 


We use the sample data on healthcare expenditures of the Medicare 
elderly first used in section 3.2 and estimate a two-component finite mixture 
of gamma regression models for the level of medical expenditures. Gamma 
regression requires positive values for the dependent variable, so we drop the 
3% of the sample with 0 expenditures. 


. * Gamma: Read chapter 3 example data and truncate expenditure at zero 
. qui use mus203mepsmedexp 


. keep if totexp > 0 
(109 observations deleted) 


. describe totexp totchr age female educyr private hvgg 


Variable Storage Display Value 
name type format label Variable label 
totexp double %12.0g Total medical expenditure 
totchr double %12.0g # of chronic problems 
age double %12.0g Age 
female double %12.0g female Female 
educyr double %12.0g Years of education 
private double %12.0g private Private supplementary insurance 
hvgg float 49 .0g hvgg Health status is excellent, good, or 


very good 


The gamma distribution is a member of the generalized linear model 
(GLM) family (see section 13.3.7) and can be estimated using the gim 
command with options family (gamma) and link (log); the latter option 
specifies an exponential conditional mean function. We obtain 


. * Gamma: Standard gamma regression (one component) as benchmark 
. glm totexp totchr age female educyr private hvgg, family(gamma) link(log) 
> vce(robust) nolog 


Generalized linear models Number of obs = 2,955 

Optimization : ML Residual df = 2,948 

Scale parameter = 2.660609 

Deviance = 4236.116216 (1/df) Deviance = 1.436946 

Pearson = 7843.476403 (1/df) Pearson = 2.660609 
Variance function: V(u) = u*2 [Gamma ] 

Link function : g(u) = In(u) [Log] 
AIC = 19.55801 
Log pseudolikelihood = -28889.96311 BIC = -19322.1 
Robust 

totexp | Coefficient std. err. z P>|zl [95% conf. interval] 

totchr . 3418083 . 0223432 15.30 0.000 . 2980165 . 3856001 

age . 0056696 . 0049044 1.16 0.248 - .0039428 .015282 

female -.1124211 .0602113 -1.87 0.062 - . 2304332 .0055909 

educyr .0272675 .0099605 2.74 0.006 .0077452 .0467897 

private . 1062875 .0638005 1.67 0.096 -.0187591 . 2313342 

hvgg -.3319719 .0643275 -5.16 0.000 -.4580515 -.2058923 

_cons 7.617334 . 3996511 19.06 0.000 6.834033 8.400636 


Two-component model parameter estimates 


The two-component gamma mixture model estimates are 


. * Gamma: Two-component finite mixture gamma regression 

. fmm 2, nolog vce(robust): glm totexp totchr age female educyr private hvgg, 
> family(gamma) link(log) 

Finite mixture model Number of obs = 2,955 
Log pseudolikelihood = -28554.52 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
1.Class (base outcome) 


2.Class 
_cons -1.105388 . 1766837 -6.26 0.000 -1.451682 -.7590941 


Class: 1 


Response: totexp 


Model: glm, family (gamma) 
Robust 
Coefficient std. err. Z P>|z| [95% conf. interval] 
totexp 
totchr .4209618 .0212731 19.79 0.000 . 3792672 . 4626564 
age .0212328 .0041791 5.08 0.000 .0130418 .0294237 
female -.0531238 .0518319 -1.02 0.305 -.1547125 .0484649 
educyr .0399077 -0080321 4.97 0.000 .0241651 .0556503 
private . 2649818 .0561469 4.72 0.000 . 154936 . 3750277 
hvgg -.0969274 .0570488 -1.70 0.089 -.2087409 .0148862 
_cons 5.229722 . 347071 15.07 0.000 4.549475 5.909968 
/totexp 
logs -.1713974 .0221162 -.2147444 -.1280504 
Class: 2 
Response: totexp 
Model: glm, family (gamma) 
Robust 
Coefficient std. err. Zz P>lzl [95% conf. interval] 
totexp 
totchr . 2683625 .0415283 6.46 0.000 . 1869684 .3497565 
age -.0077785 .0081555 -0.95 0.340 -.023763 . 008206 
female - . 1488827 .0904229 -1.65 0.100 -.3261083 . 028343 
educyr .0136653 .0162006 0.84 0.399 -.0180873 .0454179 
private . 0684348 .0970268 0.71 0.481 -.1217343 . 2586039 
hvgg -.3889646 .0983737 -3.95 0.000 -.5817735 -.1961556 
_cons 9.873871 . 7002255 14.10 0.000 8.501454 11.24629 
/totexp 
logs -.019757 .0294952 -.0775667 . 0380526 


Adding a second component greatly improved the fit of the model: the log 
likelihood increased from —28,890 to —28,555, leading to the BIC, obtained 
using the estat ic postestimation command, dropping very substantially 
from 57,836 to 57,245. 


The output displays parameter estimates for the two latent classes. These 
estimates differ considerably, which indicates heterogeneity in responses 
between the two groups. For example, the ratio of coefficients of totchr in 
the two components is about 1.57. 


Mean probabilities and mean expenditures in each class 


The estat lcprob postestimation command generates the estimated 
population proportions of the two latent classes. These are approximately 
0.75 and 0.25, respectively. The estat 1cmean command generates the class 
mean expenditures. 


. * Gamma: Latent class marginal probabilities and latent class distribution means 
. estat lcprob 


Latent class marginal probabilities Number of obs = 2,955 
Delta-method 
Margin std. err. [95% conf. interval] 
Class 
1 . 7512683 .0330159 681157 8102571 
2 . 2487317 .0330159 . 1897429 . 318843 
. estat lcmean 
Latent class marginal means Number of obs = 2,955 
Delta-method 
Margin std. err. z P>|z| [95% conf. interval] 
1 
totexp 4033.326 204.0074 19.77 0.000 3633 .479 4433.174 
2 
totexp 17254.26 1383.756 12.47 0.000 14542.15 19966.37 


The larger class has a lower mean medical expenditure than the smaller 
class, about $4,033 versus $17,254. So about 75% of the observations are 
drawn from a low-mean distribution and the rest come from a high-mean 
distribution. 


MEs 


Applying the default form of the margins, dydx() command provides a 
weighted average of the latent class-specific average marginal effect (AME), 
using the class probability as weight. 


. x Gamma: ME of change in totchr in one- and two-component models 
. margins, dydx(totchr) noatlegend // AME in two-component model 


Average marginal effects Number of obs = 2,955 
Model VCE: Robust 


Expression: Predicted mean (Total medical expenditure), using class 
probabilities, predict (mu outcome(totexp) ) 
dy/dx wrt: totchr 


Delta-method 
dy/dx std. err. z P>|zl [95% conf. interval] 


totchr 2427 .287 206.3907 11.76 0.000 2022.769 2831.805 


. qui glm totexp totchr age female educyr private hvgg, family (gamma) 
> link(log) vce(robust) 


. margins, dydx(totchr) noatlegend // AME in one-component model 


Average marginal effects Number of obs = 2,955 
Model VCE: Robust 


Expression: Predicted mean totexp, predict() 
dy/dx wrt: totchr 


Delta-method 
dy/dx std. err. Zz P>|zl [95% conf. interval] 


totchr 2530.212 197.6586 12.80 0.000 2142.809 2917.616 


The AME of an additional chronic condition (totchr) is $2,427 in the two- 
component model, similar to $2,530 from a standard (one-component) 
gamma regression. 


This example illustrates that if the target parameter is the weighted 
average of AMEs, then a one-component model could be sufficient. This 
conclusion was also reached in the case of finite mixture of linear regression; 
see sections 14.3.5 and 14.3.9. On the other hand, if the objective is to study 
in detail the differences between the two latent subpopulations, then the FMM 
framework is informative. 


Fitted means for each observation in each latent class 


The predict command generates for each individual the predicted mean of 
expenditures within each latent class. The average of these predicted means 


equals the mean expenditure generated by the preceding estat 1cmean 
command. 


We label the individual predictions mu1 and mu2. By generating 
histograms of mui and mu2, we can check the overlap between the two 
distributions. A small overlap means that the distributions are well separated 
and identification of latent classes is more robust. 


* Gamma: Generate and graph latent class predicted expenditures 
. qui fmm 2, vce(robust): glm totexp totchr age female educyr private hvgg, 
> family(gamma) link(log) 


. estimates store FMM2 


. predict mu* 
(option mu assumed) 


summarize mul mu2 


Variable Obs Mean Std. dev. Min Max 
mui 2,955 4033.326 2977 . 282 710.4969 29270.69 
mu2 2,955 17254.26 8389.956 6578.288 91909.41 


. qui histogram mui 


. qui histogram mu2 


Figure 23.1 presents the resulting histograms of the fitted means mui and 
mu2. The overall separation is quite large, with a relatively small overlap 
between the right tail of the first distribution and the left tail of the second. 
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Figure 23.1. Fitted means from FMM2 gamma regression 


MEs in each latent class 


Given the predicted mean in each latent class, we can compute separately the 
AME in each class. 


. * Gamma: Separate AME of totchr for each component 
. margins, dydx(totchr) predict(mu class(1)) predict(mu class(2)) noatlegend 


Average marginal effects Number of obs = 2,955 
Model VCE: Robust 
dy/dx wrt: totchr 


1._predict: Predicted mean (Total medical expenditure in class 1.Class), 
predict (mu class(1)) 

2._predict: Predicted mean (Total medical expenditure in class 2.Class), 
predict (mu class(2)) 


Delta-method 


dy/dx std. err. z P>lz| [95% conf. interval] 
totchr 
_predict 
1 1697 .876 126.9435 13.38 0.000 1449.072 1946.681 
2 4630.396 717.6815 6.45 0.000 3223.766 6037 .025 


The ames of $1,698 and $4,630 vary greatly across the components. Note 
that 0.7512683 x 1697.876 + 0.2487317 x 4630.396 = 2427.287, so the 
overall AME is the component probability weighted average of the component 
AMES. 


Fitted posterior class probabilities 
After fitting an FMM, one can estimate the posterior probability that 


observation ¿į belongs to latent class j, 


post Tig fi (YilXij, 85) 
ij NC 
D Tij £5 (YilXig, 03) 


This quantity is the contribution of the jth component to the density 
f(yilXij, 0;) defined in (23.1). The postestimation command predict 
postprob*, classpost generates an estimate of the posterior probability of 
membership of every observation to each of the latent classes. 


A common tule is to assign each observation to the highest probability 
class; every observation must belong to one of the latent classes. The mean 
of the posterior probabilities in each latent class is the sample estimate of the 
parameter 7;, the population proportion of that class. The estat lcprob 
command delivers these mean values and associated confidence intervals. 


It is informative to display the distribution of posterior probabilities 
given by the histograms. If the overlap in the distribution of posterior 
probabilities is small, then the mixture components are said to be well 
separated. 


. * Gamma: Generate and graph latent class posterior probabilities 
. predict postprob*, classpost 


. Summarize postprob* 


Variable Obs Mean Std. dev. Min Max 
postprobl 2,955 . 7512679 .2753035 9.27e-24 . 9684005 
postprob2 2,955 . 2487321 . 2753035 .0315995 1 


. qui histogram postprobi 
. qui histogram postprob2 


Figure 23.2 shows good separation. 
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Figure 23.2. Posterior probabilities from FMM2 gamma regression 
Model comparison 


The two-component mixture fits the data significantly better than the one- 
component model. Would adding a third component further improve the fit? 
More generally, what is the stopping rule regarding increasing C given that 


the fit of the model can always be improved, even if very slightly at times, 
by increasing C? 


While adding components may further improve the fit, there is also a 
potential for overfitting. Penalized likelihood criteria that incorporate the 
tradeoff between fit of the model and parsimony (number of parameters in 
the model) form the basis of model selection; see section 13.8.2. 


Aside from fit, in applied work, interpretability of the results is also 
important, and in that regard, qualitative features of the fit models may also 
play a role in model comparison and selection. 


We illustrate this approach in the context of model comparison of FMM2 
and FMM3 gamma mixture specifications. Such an exercise is usually 
worthwhile because the “correct” number of components is unknown. 
However, we show only parts of the full output. 


. * Gamma: Three-component finite mixture gamma regression 
. qui fmm 3, vce(robust): glm totexp totchr age female educyr private hvgg, 


> family (gamma) 
. estimates store FMM3 


. estat lcprob 


Latent class marginal probabilities 


Number of obs = 2,955 


Delta-method 
Margin std. err. [95% conf. interval] 
Class 
1 . 5082894 .0552012 . 4013778 .6144482 
2 . 2627906 .0529629 . 1725967 . 378553 
3 . 22892 .0315554 .1729591 . 296496 
. estat lcmean 
Latent class marginal means Number of obs = 2,955 
Delta-method 
Margin std. err. z P>|zl [95% conf. interval] 
1 
totexp 4117.617 328.4428 12.54 0.000 3473.881 4761.353 
2 
totexp 4782.136 1046.427 4.57 0.000 2731.177 6833.096 
3 
totexp 18013.16 1508.672 11.94 0.000 15056.21 20970.1 


For the FMM3 mixture, the mixing proportions are (0.51, 0.26, 0.23). The 
corresponding latent class means are ($4,118, $4,782, $18,013), so the 
smallest (third) latent class also has the largest mean expenditure. The 
confidence intervals generated by applying the estat 1cmean command 
show that the first two components have means that overlap considerably, 
while the third component is well separated from the other two components. 
The two smallest classes have significant overlap. 


The estimates stats postestimation command gives the information 


criteria. 


. * Gamma: Compare two-and three-component finite mixture gamma models 
. estimates stats FMM2 FMM3 


Akaike’s information criterion and Bayesian information criterion 


Model N 11(null) 11(model) df AIC BIC 
FMM2 2,955 . -28554.52 17 57143.04 57244.89 
FMM3 


2,955 . -28514.22 26 57080.43 57236.21 


Note: BIC uses N = number of observations. See [R] BIC note. 


FMM3 is slightly superior to FMM2 because it has a BIC of 57,236 compared 
with 57,245 for the FMM2 model. 


While FMM3 is a slight improvement over FMM2, the gain in terms of 
interpretability of results is limited given the large overlap in the distribution 
of predictions for components 1 and 2. Under this consideration, there is a 
case for preferring FMM2 over FMM3. 


Allowing mixing proportions to vary 


In the previous example, the mixture proportions were kept fixed. In some 
cases, it may be appropriate to parameterize the class probability as a 
function of observable variables. The resulting specification is variously 
known as a varying probability FMM or a “mixture-of-experts” model. 


For illustration, we modify the FMM2 specification such that the class 
probabilities are functions of female and age and the component regressions 
are specified with regressors totchr, age, female, educyr, private, and 
hvgg. That is, female and age affect expenditures both directly and 
indirectly. 


* Gamma: Two-component model with class probabilities varying with regressors 
. fmm 2, nolog lcprob(female age) vce(robust): glm totexp totchr age female 
> educyr private hvgg, family(gamma) link(log) 


Finite mixture model 
Log pseudolikelihood = -28551.101 


Number of obs = 2,955 


Robust 
Coefficient std. err. Zz P>lzl| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
female - . 1604032 . 278804 -0.58 0.565 -. 706849 . 3860427 
age .047179 .0225356 2.09 0.036 .00301 .091348 
_cons -4.395367 1.563402 -2.81 0.005 -7.459579 -1.331155 
Class: 1 
Response: totexp 
Model: glm, family(gamma) 
Robust 
Coefficient std. err. Z P>lzl| [95% conf. interval] 
totexp 
totchr .4219811 .022485 18.77 0.000 .3779114 .4660507 
age .012682 .0063975 1.98 0.047 .0001432 .0252209 
female -.0079641 .0818668 -0.10 0.923 -.1684202 . 1524919 
educyr .0387974 .0085443 4.54 0.000 .0220509 .0555439 
private . 2706864 .058182 4.65 0.000 . 1566517 . 3847211 
hvgg -.0761219 . 0600894 -1.27 0.205 -. 1938949 .0416511 
_cons 5.803233 -4794058 12.11 0.000 4.863615 6.742851 
/totexp 
logs -.1812314 .0241721 - . 2286078 -.133855 


Class: 2 


Response: totexp 


Model: glm, family (gamma) 
Robust 
Coefficient std. err. z P>lz| [95% conf. interval] 
totexp 
totchr . 2733363 . 0403532 6.77 0.000 . 1942455 .3524271 
age - . 0238067 .0120089 -1.98 0.047 -.0473437  -.0002697 
female - . 0886044 . 141627 -0.63 0.532 -.3661881 . 1889794 
educyr .013494 .0166899 0.81 0.419 -.0192176 .0462055 
private .0822525 .0935969 0.88 0.380 -.1011942 . 2656991 
hvgg - .3682712 0952091 -3.87 0.000 -.5548776 -.1816647 
_cons 10.96605 9102416 12.05 0.000 9.182007 12.75009 
/totexp 
logs -.0177617 0307591 - .0780484 .0425249 
. estat lcmean 
Latent class marginal means Number of obs = 2,955 
Delta-method 
Margin std. err. Zz P>|zl [95% conf. interval] 
1 
totexp 3888.241 219.6352 17.70 0.000 3457 . 764 4318.718 
2 
totexp 16781.73 1446.632 11.60 0.000 13946.38 19617.07 


For these data, there is only a small increase in the log likelihood, from — 
28,555 to —28,551. The increase in fit is not enough to compensate for the 
addition of two parameters, and the BIC increases from 57,245 to 57,254. The 
output shows that the mixture probability has weak but statistically 
significant (at 5%) positive dependence on age, and it is not expected that 
the AME estimates would change significantly. 


23.3.2 Example 2: Logit and probit regression 


In binary outcome models, unobserved heterogeneity is often modeled as an 
additive random effect. A simple example is a logit model with random 
intercept—the random component could reflect heterogeneity in preferences. 
This specification can be further extended by allowing additive random 


components in slope coefficients; it is feasible to estimate such a 
specification using the melogit command. The present FMM, in which 
unobserved heterogeneity has a discrete representation, is analogous to a 
model with finite number of “types” of individuals in the population, with 
each “type” having its own regression function. 


For binary outcome data, rather than think of the differences in latent 
classes in terms of the realized outcome that takes only the values 0 and 1, 
we think of differences in propensity scores, where a propensity score is an 
estimate of the conditional probability of observing y = 1, given other 
regressors. 


For illustration, we reconsider the binary logit example of section 17.4. 
A two-component FMM is estimated. The outcome is the health insurance 
status of an individual. Having private insurance is recorded as ins = 1. The 
regressors are health status, household income, years of education, and 
marital status. 


. * Logit: Read binary outcome chapter example data 
. qui use mus217hrs, clear 


. describe ins hstatusg hhincome educyear married 


Variable Storage Display Value 

name type format label Variable label 
ins float 7%9.0g Person has private insurance 
hstatusg float %9.0g Health status good 
hhincome float %9.0g Total household income 
educyear double %12.0g Years of education 
married double %12.0g Married 


The model estimates follow. 


. * Logit: Two-component finite mixture logit regression and predictions 
. fmm 2, nolog vce(robust): logit ins hstatusg hhincome educyear married 


Finite mixture model 
Log pseudolikelihood = -1991.8357 


Number of obs = 3,206 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
_cons -1.112667 .599617 -1.86 0.064 -2.287895 -0625605 
Class: 1 
Response: ins 
Model: logit 
Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
ins 
hstatusg . 1480639 . 2305457 0.64 0.521 - .3037974 5999251 
hhincome .0045661 .0016044 2.85 0.004 .0014215 .0077107 
educyear . 1301162 .0322808 4.03 0.000 . 066847 . 1933853 
married .0206516 .6714559 0.03 0.975 -1.295378 1.336681 
_cons -3.022338 .4784035 -6.32 0.000 -3.959992 -2.084684 


Class: 2 
Response: ins 
Model: logit 
Robust 
Coefficient std. err. Zz P>I|zl [95% conf. interval] 
ins 
hstatusg 2.94736 3.691413 0.80 0.425 -4.287676 10.1824 
hhincome -.0088856 .0083216 -1.07 0.286 -.0251957 .0074245 
educyear .6708137 . 3743691 1.79 0.073 - .0629364 1.404564 
married 6.088392 4.5532 1.34 0.181 -2.835717 15.0125 
_cons -10.34426 5.939966 -1.74 0.082 -21.98638 1.297856 


. predict classpr* 
(option mu assumed) 


. twoway (histogram classpr1, width(.05)) (histogram classpr2, 
> fcolor(white) width(.05)), legend(off) 
> xtitle("Predicted marginal probabilities for the two classes") scale(1.2) 


The coefficient estimates differ greatly across the two components. At the 
same time, the improvement in fit is modest. From output not given, the two- 
component model has arc of 4005.7 and Bic of 4072.5 compared with 
corresponding values of 4025.7 and 4056.1 for the logit model. So the AIC 
criterion favors the mixture model, while the BIC criterion, which puts a 
relatively higher weight on parsimony, favors the simple logit model. 


Figure 23.3 is a histogram of fitted probabilities that shows that the first 
component distribution has a lower mean and smaller dispersion relative to 


the second. 
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Figure 23.3. Fitted probabilities from FMM? logit regression 


The latent class means and posterior probabilities are summarized using 
the estat postestimation commands: 


. * Logit: Latent class marginal probabilities and latent class distribution means 
. estat lcmean 


Latent class marginal means Number of obs = 3,206 
Delta-method 
Margin std. err. [95% conf. interval] 
1 
ins . 2540194 055943 . 1603109 . 3778554 
2 
ins . 7979645 . 0889439 .5725601 . 9209218 
. estat lcprob, classposteriorpr 
Latent class marginal posterior probabilities Number of obs = 3,206 
Delta-method 
Margin std. err. [95% conf. interval] 
Class 
1 . 752626 .1111923 . 4855336 . 9074774 
2 . 247374 .1111923 .0925226 . 5144664 


The average fitted values of the probability of having private insurance are 
low (0.254) for the first latent class and high (0.798) for the second latent 
class. The mixture weights (or the average of the posterior probabilities for 
observations in each class) are 0.75 and 0.25 for class 1 and 2, respectively. 
These weights are used to estimate the weighted sum of the AMEs. 


The computations reported above can also be reproduced for probit and 
complementary log—log regression by replacing logit in the command 
statement with probit or cloglog. 


We next compare the AME of income on insurance status using the one- 
component and two-component models. 


. * Logit: ME of change in totchr in one- and two-component models 

. qui fmm 2, nolog vce(robust): logit ins hstatusg hhincome educyear married 

. margins, dydx(hhincome) 

Average marginal effects Number of obs = 3,206 
Model VCE: Robust 


Expression: Predicted mean (Person has private insurance), using class 
probabilities, predict(mu outcome(ins) ) 
dy/dx wrt: hhincome 


Delta-method 
dy/dx std. err. Zz P>|z| [95% conf. interval] 


hhincome . 0004929 . 0002403 2.05 0.040 . 000022 . 0009638 


. qui logit ins hstatusg hhincome educyear married, nolog 
. margins, dydx(hhincome) 


Average marginal effects Number of obs = 3,206 
Model VCE: OIM 


Expression: Pr(ins), predict() 
dy/dx wrt: hhincome 


Delta-method 
dy/dx std. err. z P>|zl [95% conf. interval] 


hhincome . 0004832 .000165 2.93 0.003 .0001599 . 0008065 


The difference is small (0.000493 versus 0.000483), which again confirms 
that for estimation of the weighted AME, there is no significant gain from 
using the FMM framework. The main advantage of the FMM specification is 
that it provides information about the heterogeneous response to changes. 


The parameter estimates of the model with all four regressors appearing 
in each component appear to be not robust. For example, the coefficients of 
hstatusg and married have very large standard errors. In such cases, one 
may wish to formally test the equality of coefficients or choose to have a 
different regressor for each component of the mixture. Such a model could 
potentially be an improvement on a one-component model. 


We next illustrate how these steps are implemented. First, we generate 
Stata’s internal name list (coefficient legend). 


* Logit: Obtain coefficient legend necessary for tests 
. qui fmm 2, vce(robust): logit ins hstatusg hhincome educyear married 


. fmm, coeflegend 


Finite mixture model Number of obs = 3,206 
Log pseudolikelihood = -1991.8357 


Coefficient Legend 


1.Class (base outcome) 
2.Class 
_cons -1.112667 _b[2.Class:_cons] 
Class: 1 
Response: ins 
Model: logit 


Coefficient Legend 


ins 
hstatusg .1480639 _b[ins:1.Class#c.hstatusg] 
hhincome .0045661 _b[ins:1.Class#c.hhincome] 
educyear .1301162 _b[ins:1.Class#c.educyear] 
married .0206516 _b[ins:1.Class#c.married] 
_cons -3.022338 _b[ins:1.Class] 
Class: 2 
Response: ins 
Model: logit 
Coefficient Legend 
ins 
hstatusg 2.94736 _blins:2.Class#c.hstatusg] 
hhincome -.0088856 _b[ins:2.Class#c.hhincome] 
educyear .6708137 _b[ins:2.Class#c.educyear] 
married 6.088392 _b[ins:2.Class#c.married] 
_cons -10.34426 _b[ins:2.Class] 


Second, given these complex names, we apply the test command to test 
equality of coefficients. 


. * Logit: Tests of coefficient equality and of regressor relevance 


. test (_b[ins:1.Class#c.hstatusg] = _b[ins:2.Class#c.hstatusg] ) 
> (_b[ins:1.Class#c.hhincome] = _b[ins:2.Class#c.hhincome] ) 
> (_b[ins:1.Class#c.educyear] = _blins:2.Class#c.educyear] ) 
> (_b[ins:1.Class#c.married] = _b[ins:2.Class#c.married]), mtest 


( 1) [ins]1ibn.Class#c.hstatusg - [ins]2.Class#c.hstatusg = 0 
( 2) [ins]1ibn.Class#c.hhincome - [ins]2.Class#c.hhincome = 0 
( 3) [ins]1bn.Class#c.educyear - [ins]2.Class#c.educyear = 0 
( 4) [ins]ibn.Class#c.married - [ins]2.Class#c.married = 0 


chi2 df p> chi2 
(1) 0.54 1 0.4623* 
(2) 2.95 1 0.0861* 
(3) 2.31 1 0.1283* 
(4) 2.23 1 0.1357* 
All 7.40 4 0.1163 


* Unadjusted p-values 


A joint test does not reject the null hypothesis of equality of all coefficients 
at level 0.05. With individual tests, the equality hypothesis is not rejected for 
all variables at level 0.05. If this is viewed as a multiple testing situation (see 
section 11.6) and test command option mtest (sidak) 1s used to compute 
the Sidak correction, then again all variables are statistically insignificant. 


A next step might be to respecify the model so class 1 regressors are 
hhincome educyear and class 2 regressors are hstatusg educyear 


married. 
23.3.3 Example 3: Multinomial logit regression 


The multinomial logit model (MNL) was considered in section 18.4. Together 
with the multinomial probit (MNP), this specification is standard for modeling 
unordered choices when there are more than two alternatives. MNL 1s 
computationally easier to estimate than MNP but suffers from the 
independence of irrelevant alternatives property that places nontrivial 
restrictions on the choice set; see section 18.6.1. One relatively straight- 
forward way of overcoming this limitation is to replace MNL by a finite 
mixture that is still computationally easier to estimate than MNP. 


For illustration, we use the same choice-of-fishing-mode data as used in 
section 18.3. The fishing site is one of four possible sites. For simplicity, 
only one regressor, income, 1s used. 


. * MNL: Read multinomial outcome chapter example data 
. qui use mus218hk, clear 


. describe mode income 


Variable Storage Display Value 
name type format label Variable label 
mode float “49 .0g modetype Fishing mode 
income float 7%9.0g Monthly income in thousands $ 


. Summarize mode income 


Variable | Obs Mean Std. dev. Min Max 
mode 1,182 3.005076 . 9936162 1 4 
income 1,182 4.099337 2.461964 .4166667 12.5 


We fit a two-component MNL FMM. 


* MNL: Two-component finite mixture logit regression 


. fmm 2, nolog vce(robust): mlogit mode income 


Finite mixture model 
Log pseudolikelihood = -1466.0383 


Number of obs = 1,182 


Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 

1.Class (base outcome) 
2.Class 

_cons .6965883 . 1996388 3.49 0.000 . 3053035 1.087873 
Class: 1 
Response: mode 
Model: mlogit 

Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 

1.mode (base outcome) 
2.mode 

income -4677133 . 5583008 0.84 0.402 - .6265362 1.561963 

_cons .5910911 8454539 0.70 0.484 -1.065968 2.24815 
3.mode 

income 3.737117 .941986 3.97 0.000 1.890858 5.583375 

_cons -23.02747 7.881857 -2.92 0.003 -38.47563 -7.579318 
4.mode 

income 1.926167 -6855103 2.81 0.005 .5825912 3.269742 

_cons -1.985153 1.795477 -1.11 0.269 -5.504224 1.533917 


Class: 2 
Response: mode 


Model: mlogit 
Robust 
Coefficient std. err. A P>lz| [95% conf. interval] 
1.mode (base outcome) 
2.mode 
income -.0306097 .0889195 -0.34 0.731 - . 2048887 . 1436692 
_cons .0997866 -5711038 0.17 0.861 -1.019556 1.21913 
3.mode 
income .0140262 .0523073 0.27 0.789 -.0884941 . 1165466 
_cons 1.177968 . 3004752 3.92 0.000 . 5890475 1.766889 
4.mode 
income -.441771 . 1714284 -2.58 0.010 -.7777645 -.1057775 
cons 1.737714 .579962 3.00 0.003 .6010095 2.874419 


The coefficients in the mode equation appear to be significantly different 


between the two classes. 


The next set of results obtains the latent class probabilities. 


. * MNL: Latent class marginal probabilities and distribution means 


. estat lcmean 


Latent class marginal means 


Delta-method 


Number of obs = 1,182 


Margin std. err. [95% conf. interval] 
1 
mode 
Beach 0397643 025144 0112634 . 1308403 
Pier . 1597699 0578606 .0755287 . 3067894 
Private 0234452 . 0042244 .0164475 .0333192 
Charter . 7770206 . 0620006 . 6334436 .8754201 
2 
mode 
Beach . 1497711 .0193181 .1157062 . 1916909 
Pier . 1450983 .0309011 . 0943342 . 2166453 
Private .5176325 . 034598 .4499131 . 5847104 
Charter . 1874981 .0553217 . 1017353 .3198181 


. estat lcprob, classposteriorpr 


Latent class marginal posterior probabilities Number of obs = 1,182 


Delta-method 


Margin std. err. [95% conf. interval] 

Class 
1 . 3325691 .0410668 .257461 .4172751 
2 .6674309 .0410668 . 5827249 . 742539 


The latent class probabilities are estimated to be 0.33 and 0.67, respectively. 


The margins command yields the AME of income on each mode of 
fishing. 


* MNL: ME of regressor changes in two-component model 
. margins, dydx(*) 


Average marginal effects Number of obs = 1,182 
Model VCE: Robust 


dy/dx wrt: income 


1._predict: Predicted mean (1.mode), using class probabilities, predict (mu 
outcome(mode 1)) 

2._predict: Predicted mean (2.mode), using class probabilities, predict (mu 
outcome(mode 2)) 

3._predict: Predicted mean (3.mode), using class probabilities, predict (mu 
outcome (mode 3)) 

4._predict: Predicted mean (4.mode), using class probabilities, predict (mu 
outcome(mode 4)) 


Delta-method 


dy/dx std. err. z P>|z| [95% conf. interval] 
income 
_predict 
1 -.005324 .0056746 -0.94 0.348 -.016446 .0057981 
2 -.0304964 . 0062628 -4.87 0.000 -.0427712 -.0182216 
3 0325652 .007 1824 4.53 0.000 .018488 . 0466424 
4 .0032552 .0082157 0.40 0.692 -.0128474 .0193577 


Only fishing modes 2 and 3 provide statistically significant (at 5%) estimates 
of the AME of income. An increase in income reduces the probability that 
mode 2 would be chosen and increases the probability that mode 3 would be 
chosen. Qualitatively, this result is similar to what was obtained for the (one- 
component) MNL model; see section 18.4.2. The lack of improved precision 
in this example is not surprising. From output not included, the two- 


component model had lower aic than the one-component model (2,958.1 
versus 2,966.3) but higher Bic (3,024.1 versus 2,996.8). 


23.3.4 Example 4: Tobit regression 


Tobit regression is an established tool for handling censored data. The tobit 
command yields estimates of 3 and ¢2 in the model y* = x’ + £, where 
e ~ N(0,o7) and y = y* if y* > 0 and y = 0 otherwise. 


However, tobit regression is based on the strong assumptions of 
normality and homoskedasticity—assumptions that frequently fail diagnostic 
tests. We can relax the normality assumption by replacing the standard tobit 
model with a mixture of tobits. This also goes some way toward relaxing the 
homoskedasticity assumption. 


We illustrate a mixture of tobits using the same data as in section 19.3. 
The dependent variable is the level of health expenditures, with 16% of 
observations censored at 0. Estimation of a two-component model yields 


* Tobit: Two-component finite mixture tobit 
. qui use mus219mepsambexp, clear 


. global xlist age female educ blhisp totchr ins 


. fmm 2, vce(robust) nolog: tobit ambexp $xlist, 11(0) 


Finite mixture model 
Log pseudolikelihood = -24776.681 


Number of obs = 3,328 


Robust 
Coefficient std. err. P>|z| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
_cons -1.103018 . 1924958 -5.73 0.000 -1.480303 -.7257335 
Class: 1 
Response: ambexp Censoring of obs: 
Model: tobit Uncensored = 2,802 
Left-censored = 526 
Right-censored = (0) 
Robust 
Coefficient std. err. z P>|z| [95% conf. interval] 
ambexp 
age 98 .02374 22.58601 4.34 0.000 53.75596 142.2915 
female 353.825 46 .33263 7.64 0.000 263.0147 444 .6353 
educ 33.5267 6.584909 5.09 0.000 20.62052 46 .43289 
blhisp -182.4351 40.72239 -4.48 0.000 -262.2495 -102.6207 
totchr 443.3924 45.40638 9.76 0.000 354.3976 532.3873 
ins 34.8888 34.74336 1.00 0.315 -33.20693 102.9845 
_cons -577 .927 127.9969 -4.52 0.000 -828.7962 -327 .0577 
var (e .ambexp) 430559.5 81254.07 297438.2 623260.7 


Class: 2 


Response: ambexp Censoring of obs: 
Model: tobit Uncensored = 2,802 
Left-censored = 526 
Right-censored = 0 
Robust 
Coefficient std. err. Zz P>|zl [95% conf. interval] 
ambexp 
age 531.4028 139.278 3.82 0.000 258.4229 804.3826 
female 1046.152 388.9827 2.69 0.007 283.76 1808.544 
educ 123.732 59.06497 2.09 0.036 7.966774 239.4972 
blhisp -1087.516 439.5393 -2.47 0.013 -1948.997 -226.0345 
totchr 1947.121 237.1148 8.21 0.000 1482.385 2411.858 
ins -779.89283 364.4406 -2.14 0.032 -1494.183 -65.60237 
-cons -2223.764 1064.804 -2.09 0.037 -4310.741 -136.7874 
var (e .ambexp) 1.89e+07 5793808 1.03e+07 3.44e+07 


. estimates store FMM2 


The parameter estimates differ greatly across the two components. From 
output not included, the first component has probability 0.751 and average 
predicted mean of y* of 619, while the second component has probability 
0.249 and much higher average predicted mean of 2,443. 


The margins command provides margins for the latent variable y* and 
not the observed outcome y. Because E(y*|x) = x’, the resulting MEs 
from the two-component tobit model are simply the class probability 
weighted averages of the slope coefficients already given. 


We have 


. * Tobit: AME for two-component finite mixture tobit for E[y*|x] 

. margins, dydx(*) 

Average marginal effects Number of obs = 3,328 
Model VCE: Robust 


Expression: Predicted mean (Annual ambulatory expenditures), using class 
probabilities, predict (mu outcome(ambexp) ) 
dy/dx wrt: age female educ blhisp totchr ins 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 

age 206.0109 35.53168 5.80 0.000 136.37 275.6517 
female 526.3354 97.20772 5.41 0.000 335.8118 716.8591 
educ 56.00358 14.35708 3.90 0.000 27 . 86422 84.14295 
blhisp -407 . 9584 114.6833 -3:56 0.000 -632.7336 -183.1832 
totchr 818.0837 63.75876 12.83 0.000 693.1188 943 .0486 
ins -168.1342 82.63214 -2.03 0.042 -330 . 0902 -6.178186 


Finally, we compare the goodness of fit of the two-component tobit 
model with the regular tobit model. 


* Tobit: Compare FMM2 to regular tobit using AIC and BIC 
. qui tobit ambexp $xlist, 11(0) vce(robust) 


. estimates store FMM1 
. estimates stats FMM1 FMM2 


Akaike’s information criterion and Bayesian information criterion 


Model N 11(null) 11(model) df AIC BIC 
FMM1 3,328 -26706.46 -26359.42 8 52734.85 52783.73 
FMM2 3,328 . -24776.68 17 49587 .36 49691.23 


Note: BIC uses N = number of observations. See [R] BIC note. 


The FMM? tobit has greatly superior fit using the information criteria. 
23.3.5 Example 5: Poisson regression 


In this section, we reconsider the doctor-visits example, which appeared in 
section 20.5.4. There the application of the FMM to count regression was 
explored in detail. The discussion of the current application is confined to 
the more important features only. We compare the basic Poisson regression 
with FMM2 Poisson and FMM3 Poisson in terms of goodness of fit and their 


implications for heterogeneity in response to a representative regressor, 
which we take to be totchr. We also calculate the latent class means using 
the estat 1cmean command and the mixture probability weights using the 
estat lcprob command. 


* Poisson: Two-component finite mixture Poisson using count chapter data 
. use mus220mepsdocvis, clear 
(A.C.Cameron & P.K.Trivedi (2022): Microeconometrics Using Stata, 2e) 


. global xlist private medicaid age age2 educyr actlim totchr 
. qui fmm 2, vce(robust): poisson docvis $xlist 
. estimates store fmm2pois 


. estat lcmean 


Latent class marginal means Number of obs = 3,677 
Delta-method 
Margin std. err. Zz P>lz| [95% conf. interval] 
1 
docvis 5.050474 . 2713086 18.62 0.000 4.518719 5.582229 
2 
docvis 11.65096 .6255685 18.62 0.000 10.42487 12.87706 
. estat lcprob 
Latent class marginal probabilities Number of obs = 3,677 


Delta-method 


Margin std. err. [95% conf. interval] 

Class 
1 .6452176 .0268118 .5911008 . 6958574 
2 . 3547824 .0268118 . 3041426 . 4088992 


Comparing the basic Poisson and FMM? Poisson from output given at the end 
of this subsection, we see there is a major improvement in the log likelihood. 
The two classes have average doctor visits of 5.05 and 11.65 for the first and 
second latent classes, respectively. The first component, the lower usage 
group, has an estimated probability weight of 0.65. The second component is 
a high-user group, and it has probability weight of 0.35. 


We next compute and report the AME of totchr on doctor visits from two 
models, FMM2 and one-component Poisson. 


. * Poisson: ME of regressor changes in two-component model 

. margins, dydx(totchr) 

Average marginal effects Number of obs = 3,677 
Model VCE: Robust 


Expression: Predicted mean (# doctor visits), using class probabilities, 
predict(mu outcome(docvis) ) 
dy/dx wrt: totchr 


Delta-method 
dy/dx std. err. Zz P>lz| [95% conf. interval] 


totchr 1.855912 . 1470048 12.62 0.000 1.567788 2.144036 


. qui poisson docvis $xlist, vce(robust) 
. estimates store poisson 
. margins, dydx(totchr) 


Average marginal effects Number of obs = 3,677 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: totchr 


Delta-method 
dy/dx std. err. z P>|z| [95% conf. interval] 


totchr 1.694685 . 0908883 18.65 0.000 1.516547 1.872823 


The AME of totchr is estimated to be 1.856 visits, not too different 
from 1.695 for the simpler Poisson model. 


The detailed FMM2 Poisson coefficient estimates, not displayed above, 
show that the coefficient of tot chr in the FMM? regression is 0.329 for the 
first component and 0.173 for the second, so the multiplicative effect of an 
additional chronic condition differs across the two components. Thus, from 
output not listed, the ME of totchr evaluated at the sample means of the 
regressors is 4.84 for the FMM2 model and is 1.16 for the Poisson model. 
Hence, the mixture model is informative about heterogeneity in response 
between the two latent classes. 


A finer distinction between the latent groups in the population adds an 
additional regression component. FMM3 Poisson regression yields 


. * Poisson: Three-component finite mixture Poisson regression 
. qui fmm 3, nolog vce(robust): poisson docvis $xlist 


. estat lcmean 


Latent class marginal means Number of obs = 3,677 
Delta-method 
Margin std. err. z P>Izl [95% conf. interval] 
1 
docvis 3.443653 . 1134148 30.36 0.000 3.221365 3.665942 
2 
docvis 31.63433 2.238283 14.13 0.000 27 . 24737 36.02128 
3 
docvis 11.63865 . 3487926 33.37 0.000 10.95503 12.32227 
. estat lcprob 
Latent class marginal probabilities Number of obs = 3,677 
Delta-method 
Margin std. err. [95% conf. interval] 
Class 
1 . 6234456 .0177598 . 5880541 .6575672 
2 .0391851 .0061092 .0288178 .0530781 
3 . 3373693 .0168279 . 305217 .3710995 


. estimates store fmm3pois 


The first and largest class with probability 0.62 corresponds to a low-use 
group with an average of 3.44 doctor visits. The second and smallest class 
with probability 0.04 has an average of 31.63 doctor visits, and the third 
class with probability 0.34 has an average of 11.64 visits. 


A table of information criteria is generated using the estimates stats 
command. 


* Poisson: Goodness-of-fit for two- and three-component models 
. estimates stats poisson fmm2pois fmm3pois 


Akaike’s information criterion and Bayesian information criterion 


Model N 11(null) 11(model) df AIC BIC 
poisson 3,677 -17258.63 -15019.64 8 30055.28 30104.96 
fmm2pois 3,677 . -12100.19 17 24234.37 24339.94 
fmm3pois 3,677 . -10907.21 26 21866.42 22027 .88 


Note: BIC uses N = number of observations. See [R] BIC note. 


The third added component improves the log likelihood enough to be 
preferred using the AIC and BIC criteria. Compared with the three-component 
model, the two-component fmm2pois model essentially merges the last two 
latent classes. 


23.3.6 Example 6: Mixture distribution with point mass 


In all the previously considered mixture distributions, the components were 
members of a common family of distributions. There is no restriction that 
this should always be the case. Valid mixture distributions can be 
constructed using members of different families of distributions. 


Zero-inflated Poisson and negative binomial distributions, previously 
covered in section 20.6, are leading examples of point-mass mixtures. These 
models are a mixture of a binary regression (often logit) for modeling the 
point mass at zero and a Poisson or negative binomial regression for counts. 


The following example for number of emergency room visits that are 
zero for most of the sample provides an fmm prefix equivalent to the zinb 
command for count data. The output is omitted because it is provided in 
section 20.6.5. 


. * fmm point-mass two-component command same as zero-inflated negative binomial 
. use mus220mepsemergroom, clear 


. fmm: (pointmass er, lcprob(age 1.actlim totchr)) (nbreg er age 1.actlim totchr) 
(output omitted ) 
. Zinb er age 1.actlim totchr, inflate(age 1.actlim totchr) 


(output omitted ) 


23.3.7 Example 7: Mixture application using survival data 


We reconsider the model of the duration of the length of jobless spells, using 
data analyzed in section 21.2. Unobserved heterogeneity (“frailty”), usually 
treated as a continuous random variable, plays an important role in duration 
models. An FMM can be constructed using components with such a structure; 
a mixture of Weibull is an example. 


The dependent variable spe11 is the number of periods jobless 
(measured in two-week intervals) and is censored from above. The 
regression includes ui (unemployment insurance status) and logwage as 
regressors. Data definitions and summary follow. 


. * Weibull: Read in and describe unemployment duration data from survival chapter 
. qui use mus221imccall, clear 


. describe spell ui logwage 


Variable Storage Display Value 
name type format label Variable label 
spell int 48 .0g Periods jobless: two-week intervals 
ui int 48 .Og Filed UI claim 
logwage float 7%8.0g log weekly earnings 
. summarize spell ui logwage 
Variable Obs Mean Std. dev. Min Max 
spell 3,343 6.247981 5.611271 1 28 
ui 3,343 .5527969 -4972791 0 1 
logwage 3,343 5.692994 .5356591 2.70805 7.600402 


Next, we fit the standard Weibull regression model. To save space, we do 
not report full output, but the AMEs are shown. 


. * Weibull: Standard Weibull regression (one component) as benchmark 
. stset spell, fail(censor1=1) 
Survival-time data settings 


Failure event: censor1== 
Observed time interval: (0, spell] 
Exit on or before: failure 


3,343 total observations 
O exclusions 


3,343 observations remaining, representing 
1,073 failures in single-record/single-failure data 
20,887 total analysis time at risk and under observation 


At risk from t = 0 
Earliest observed entry t = 0 
Last observed exit t = 28 


. qui streg i.ui logwage, nohr vce(robust) dist(weibull) nolog 
. margins, dydx (ui) 


Average marginal effects Number of obs = 3,343 
Model VCE: Robust 


Expression: Predicted median _t, predict() 
dy/dx wrt: 1.ui 


Delta-method 
dy/dx std. err. z P>|z| [95% conf. interval] 


1.ui 14.02813 1.066691 13.15 0.000 11.93746 16.11881 


Note: dy/dx for factor levels is the discrete change from the base level. 


The AME of having unemployment insurance (ui) on unemployment duration 
is positive. 


Although the Weibull duration model is very widely used, in principle, a 
two-component mixture of Weibull should provide a better fit to the data. 
We use the fmm prefix to fit such a model. 


. * Weibull: Two-component finite mixture Weibull regression 
. fmm 2, nolog vce(robust): streg i.ui logwage, distribution(weibull) 


Finite mixture model 
Log pseudolikelihood = -3938.0613 


Number of obs = 3,343 


Robust 
Coefficient std. err. Z P>lz| [95% conf. interval] 
1.Class (base outcome) 
2.Class 
_cons 1.368373 . 1815214 7.54 0.000 1.012598 1.724149 
Class: 1 
Response: _t No. of failures = 1,073 
Model: streg, dist (weibull) Time at risk 20 , 887 .00 
Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 
Eats 
1.ui -3.662912 .5784759 -6.33 0.000 -4.796704 -2.52912 
logwage .4706624 . 2990599 1.57 0.116 -.1154842 1.056809 
_cons -3.711542 1.833207 -2.02 0.043 -7 .304561 -.1185235 
/_t 
ln_p . 9782369 . 1106777 . 7613126 1.195161 
1 
Class: 2 
Response: _t No. of failures = 1,073 
Model: streg, dist (weibull) Time at risk 20,887.00 
Robust 
Coefficient std. err. z P>|z| [95% conf. interval] 
_t 
1.ui -1.108708 . 1095215 -10.12 0.000 -1.323366 -.8940497 
logwage .485504 .0881343 5.51 0.000 . 3127639 .6582441 
_cons -6.509076 . 583254 -11.16 0.000 -7.652233 -5.365919 
/_t 
ln_p . 3034215 .0463767 . 2125248 . 3943182 


The results are easier to interpret if we obtain latent class probabilities 


and means. 


* Weibull: Latent class probabilities and means for two-component model 
. estat lcprob 


Latent class marginal probabilities Number of obs = 3,343 


Delta-method 


Margin std. err. [95% conf. interval] 
Class 
1 . 2028828 .0293559 . 1513376 . 2664717 
2 7971172 .0293559 . 7335283 . 8486624 
. estat lcmean 
Latent class marginal means Number of obs = 3,343 


Delta-method 


Margin std. err. z P>Izl [95% conf. interval] 

1 
_t 3.441098 . 3002856 11.46 0.000 2.852549 4.029647 

2 
t 24.8878 1.58575 15.69 0.000 21.77979 27.99581 


The first component has probability 0.20 and a short average duration of 3.4 
periods, while the second component has probability 0.80 with a much 
longer average of duration of 24.9 periods. 


Finally, we obtain the AMEs. 


. * Weibull: MEs for two-component model 

. margins, dydx(ui) 

Average marginal effects Number of obs = 3,343 
Model VCE: Robust 


Expression: Predicted mean (Analysis time when record ends), using class 
probabilities, predict(mu outcome(_t)) 
dy/dx wrt: 1.ui 


Delta-method 
dy/dx std. err. Zz P>lz| [95% conf. interval] 


1.ui 15.76266 1.939978 8.13 0.000 11.96037 19.56495 


Note: dy/dx for factor levels is the discrete change from the base level. 


The AME of having unemployment insurance is 15.76 periods of 
unemployment, compared with 14.03 periods using the standard Weibull 


model. From output not given, the log likelihood rises from —4,008.8 to — 
3,938.1, leading to considerably lower BIC. 


23.3.8 Heterogeneity and quantile regression 


FMMS provide a parametric approach for modeling heterogeneous responses. 
In section 15.3, we used the conditional quantile regression (CQR) method, a 
semiparametric estimator, to analyze medical expenditures at different 
quantiles of expenditure distribution. We used postestimation tests of 
equality of regression coefficients at different quantiles. CQR is a well- 
established method for studying heterogeneous responses of continuous 
outcomes. The method has been extended to count regressions; see 

section 20.9. 


23.4 Nonlinear mixed-effects models 


Mixed-effects models or multilevel models or hierarchical models permit more 
flexible models with several levels of nesting, such as students nested in classes 
nested in schools. Special cases of these models include models with intercepts 
and slopes that may be random. This can lead to more efficient estimation, and 
additionally, this approach is especially advantageous in settings where 
heterogeneity is of intrinsic interest. 


Linear mixed models can be fit using the mixed command, presented in 
section 6.7. Here we present various me commands that extend mixed effects to 
nonlinear regression models. 


The megim command extends mixed-effects analysis to the class of GLMs 
presented in section 13.3.7. Some GLMs are used so often that Stata additionally 
provides distinct estimation commands, rather than requiring use of the meg1lm 
command and specification of the relevant link function and family. These 
distinct commands are melogit, meprobit, mecloglog, meologit, meprobit, 


mepoisson, meoprobit, and menbreg. 


Additional nonlinear mixed commands are mest reg for parametric survival 
models, men1 for models with a nonlinear conditional mean and additive errors, 
meintreg for interval-regression models, and metobit for tobit models. 


Nonlinear mixed-effects models face two challenges that are not present in 
the linear case. First, with rare exception, any model misspecification leads to 
inconsistency of parameter estimates. By contrast, in the linear case, 
consistency requires only correct specification of the conditional mean, though 
in the linear case, standard errors still need to be adjusted if the distribution for 
the mixed effects is misspecified. Thus, it may be best in the nonlinear case to 
use rich models for the mixed effects rather than, for example, having only 
intercepts that are random. 


Second, the models are nonlinear models that can be computationally 
challenging to fit. 


23.4.1 Mixed-effects GLM and the meglm command 


From section 13.3.7, a GLM specifies that yi has conditional mean 

E(yi|xi) = g~ 1 (x13), where g(-) is called the link function. For example, for 
Poisson regression, E(y;|x;) = exp(x; 6), and the link function g(-) is In(-), 
the inverse of the function. 


We consider a two-level mixed-effects GLM for simplicity with individual ; 
in group j. Then, we add the mixed-effects term z;;uj, where Zij are observable 
variables and Uj; is a vector of 1.1.d. normally distributed random effects. We 
have 


Ee hia tags Ae) = g (x58 a Zi; Uj) 


where u; ~ N(O, Xu) and the specified link function g(-) differs for different 
GLMS. 


Specific choices of Zij lead to some standard models. A regular GLM 
corresponds to Zi; = 0. A random-effects model with random intercept 
corresponds to Z;; = 1 because only the intercept is random, and then 
Zi, jUy = Uy. A model often called the random-coefficients model sets Zij = Xij 
so that, for the regressor Xij, both the intercept and slope coefficients are 


random. 


Denote the conditional density for an individual observation as f (yi; |m; ) 
where i; = X;;3 + 2;,;u;. For example, for the Poisson model, 
Flyin) = exp(—my) nj,” /y:;| The log density for all N; individuals in group 
jis DE | In f (yij|ni;)- The problem is that this log density includes the 
random effects U; that need to be integrated out. The log likelihood for cluster j 
upon integrating out these unobservables is then 


Nj 
L;(8, Zu) = 1 Soin F{uijlg (xB + zijuj)} | x o(uj|0, Du)du; (23.2) 
j=l 


where ¢(z|0, ©) denotes the density of the multivariate normal with mean 0 
and variance >). 


Consistent estimation of 6 will require correct specification of all three of 
the density f(-), the link function g(-), and the distribution of uj. By contrast, 
in the simpler GLM without mixed effects, or in the linear mixed-effects model, 
consistent estimation of 6 requires only correct specification of the conditional 
mean, that is, of the link function. Unlike the linear case, the integral in (23.2) is 
multidimensional with no closed-form solution. Numerical computation can be 
challenging, especially if Ui; is high dimensional. 


The meglm command has format similar to the mixed command presented in 
section 6.7.3. The specific GLM model fit is defined using the family () and 
link() options. 


Also, megim provides six numerical integration methods: 


1. the mean-and-variance adaptive Gauss—Hermite quadrature (gsem option 
intmethod(mvaghermite), which is the default); 

2. the mode-and-curvature adaptive Gauss—Hermite quadrature 
(intmethod (mcaghermite) ); 

3. the Pinheiro—Chao mode-curvature adaptive Gauss—Hermite quadrature 
(in tmethod(pcaghermite) ); 

4. the nonadaptive Gauss—Hermite quadrature (intmethod (ghermite)); 

5. the Laplacian approximation (intmethod (laplace) ); and 

6. the Pinheiro—Chao Laplacian approximation (intmethod(pclaplace) ). 


See section 5.5.1 for a brief introduction. 


Methods 1 and 2 are preferred, while methods 3 to 6 may provide better 
starting values for methods 1 and 2 if there are convergence problems. 
Techniques pcaghermite and pclaplace can speed up computation but are 
available only with family (binomial) and family(bernoulli) combined 
with link(logit) and with family (poisson). Option intpoints (#) enables 
changing the number of evaluation points in methods 1—4 from the default of 7. 
Choosing a lower number speeds up model fitting and improves convergence 
somewhat but at the expense of less accuracy. 


23.4.2 Poisson and negative binomial mixed-effects example 


As a nonlinear two-level mixed-effects example, we consider Poisson 
regression using a dataset with clustering introduced in section 6.4. The 


dependent variable is the number of pharmacy visits (pharvis), and the 
regressors are an indicator for illness (illness) and log total household 
expenditure (1nhhexp). 


. * Read in data and drop a few observations 
. qui use mus206vlss, clear 


. drop if missing(lnhhexp) | (lnhhexp > 2.579681 & lnhhexp < 2.579683) 
(13 observations deleted) 


. Summarize pharvis lnhhexp illness commune 


Variable Obs Mean Std. dev. Min Max 
pharvis 27,753 .5110439 1.312606 (0) 30 
lnhhexp 27,753 2.60262 . 6245493 .0467014 5.405502 
illness 27,753 .6218427 .8995018 (0) 9 
commune 27,753 101.514 56.27264 1 194 


Individuals live in one of 194 communes, and we expect individual 
responses to vary by commune. So we specify commune as a second level and 
allow the intercept and the coefficient of illness to vary at this second level. 
Then Yi; has conditional mean 


exp{ (81 + u1;) + (G2 + u2;) x illmess;; + 83 x Inhhexp, ;} 


where “1; and U2; are zero-mean normally distributed random variables that are 
independently drawn across categories j that correspond to different ages 
(measured in years). 


The mixed-effects Poisson model can be fit using the meglm command with 
options family (poisson) and link (log). Alternatively, and more simply, the 
equivalent mepoisson command can be used. We obtain 


. * Mixed model: Poisson, 2nd level commune, coeff of intercept and illness vary 


. mepoisson pharvis lnhhexp illness || commune: illness, vce(robust) 


Fitting fixed-effects model: 


Iteration 0: log likelihood = -71403.382 

Iteration 1: log likelihood = -26812.91 

Iteration 2: log likelihood = -26219.27 

Iteration 3: log likelihood = -26217.797 

Iteration 4: log likelihood = -26217.797 

Refining starting values: 

Grid node 0: log likelihood = -23897.571 

Fitting full model: 

Iteration 0: log pseudolikelihood = -23897.571 (not concave) 

Iteration 1: log pseudolikelihood = -23867.866 (not concave) 

Iteration 2: log pseudolikelihood = -23840.085 (not concave) 

Iteration 3: log pseudolikelihood = -23818.108 

Iteration 4: log pseudolikelihood = -23803.899 

Iteration 5: log pseudolikelihood = -23777.049 

Iteration 6: log pseudolikelihood -23759.62 

Iteration 7: log pseudolikelihood = -23759.194 

Iteration 8: log pseudolikelihood = -23759.192 

Mixed-effects Poisson regression Number of obs 27,753 

Group variable: commune Number of groups = 194 

Obs per group: 

min = 51 
avg = 143.1 
max = 206 

Integration method: mvaghermite Integration pts. = 7 

Wald chi2(2) = 261.87 
Log pseudolikelihood = -23759.192 Prob > chi2 = 0.0000 


(Std. err. adjusted for 194 clusters in commune) 
Robust 

pharvis | Coefficient std. err. Zz P>|zl [95% conf. interval] 
lnhhexp -.1287277 0410759 -3.13 0.002 -.209235 - .0482204 
illness .9563129 .0613635 15.58 0.000 . 8360427 1.076583 
_cons -1.305269 .1289779 -10.12 0.000 -1.558061 -1.052477 

commune 
var (illness) .1797316 .0520961 . 1018356 .3172119 
var(_cons) .404467 .0681602 . 2906965 . 562764 


The first iterations that are described as fitting what the mixed-models literature 
calls a fixed-effects model simply perform the standard Poisson regression. 
Subsequent iterations are initially over a nonconcave portion of the objective 
function. Here the defaults of mean-and-variance adaptive Gauss—Hermite 
quadrature with seven evaluation points are used. The default for the mepoisson 
command is to restrict the two random effects to be independent of each other; 


the option covariance (unstructured) allows the random effects to be 
correlated. 


There is substantial variation across communes in the coefficient of illness 
because the commune-specific random effect of illness has standard deviation 
of \/0.1797316 = 0.424 compared with the fixed coefficient of illness that 
has value of 0.956. 


A Wald test of the joint statistical significance of the random coefficients, a 
joint test of whether Var(ui,;) = 0 and Var(u2;) = 0, yields 


x Wald test of the random coefficients 
. test ([/]var(illness[commune])=0) ([/]var(_cons [commune] )=0) 


( 1) ([/]var(illness[commune]) = 0 
( 2) [/]var(_cons[commune]) = 0 


chi2( 2) = 73.25 
Prob > chi2 = 0.0000 


Because of the one-sided nature of the test (the variances must be greater than 
zero), the true distribution of this test statistic is complicated. Using y?(2) 
critical values provides an upper bound to the reported p-value, so the test is 
conservative. Because p = 0.000, we definitely find statistical significance at 
level 0.05. 


We next fit a Poisson model and ME Poisson and ME negative binomial 
models with different estimates of standard errors. 


. * Mixed model: Compute various models and standard errors 
. qui poisson pharvis Inhhexp illness, vce(cluster commune) 


. estimates store p_clu 

. qui mepoisson pharvis lnhhexp illness || commune: illness 

. estimates store mep_def 

. qui mepoisson pharvis lnhhexp illness || commune: illness, vce(cluster commune) 
. estimates store mep_rob 

. qui menbreg pharvis lnhhexp illness || commune: illness 

. estimates store menb_def 

. qui menbreg pharvis lnhhexp illness || commune: illness, vce(robust) 


. estimates store menb_rob 


The following table presents the results: 


* Mixed model: Compare with standard Poisson and get robust standard errors 
. estimates table p_clu mep_def mep_rob menb_def menb_rob, 
> eq(1) b(⁄%8.4f) se stfimt(%8.1f) stats(11 N aic chi2_c df_c) 


Variable p_clu mep_def mep_rob menb_def menb_rob 
#1 
lnhhexp 0.0175 -0.1287 -0.1287 -0.0875 -0.0875 
0.0473 0.0188 0.0411 0.0277 0.0405 
illness 0.6806 0.9563 0.9563 1.3292 1.3292 
0.0287 0.0319 0.0614 0.0461 0.0575 
_cons -1.4121 -1.3053 -1.3053 -1.9047 -1.9047 
0.1451 0.0681 0.1290 0.0878 0.1248 
var( 
il~ss[com~e]) 0.1797 0.1797 0.3216 0.3216 
0.0254 0.0521 0.0438 0.0538 
var ( 
_cons [come] ) 0.4045 0.4045 0.3542 0.3542 
0.0497 0.0682 0.0480 0.0476 
/1nalpha 0.1792 0.1792 
0.0276 0.0603 
Statistics 
11 -26217.8 -23759.2 -23759.2 -21116.2 -21116.2 
N 27753 27753 27753 27753 27753 
aic 52441 .6 47528 .4 47528 .4 42244 .5 42244 .5 
chi2_c 4917.2 4917.2 1753.1 1753.1 
df_c 2.0 2.0 2.0 2.0 


Legend: b/se 


Comparing the first and second columns, we see considerable difference in 
the estimates of the main model slope parameters. Unlike linear models, 
introducing random effects in a nonlinear model changes the functional form for 


Ey; bree 


The vce (robust) option of the mepoisson command obtains cluster—robust 
standard errors that cluster on commune. Comparing the second and third 
columns, we see that the default standard errors are greatly understated and the 
vce (robust) option should be used. The fourth and fifth columns provide 
estimates for a mixed negative binomial model, using the noreg command, and 
even though the negative binomial may better control for the overdispersion 
typical of count data, the vce (robust) option should be used. The information 
criteria strongly favor the mixed negative binomial model. 


23.4.3 Prediction and MEs 


The random effects are often of intrinsic interest, and even if that is not the case, 
computation of conditional mean predictions and MEs requires control for the 
random effects. 


We first illustrate predict. The reffects and ebmeans options obtain 
predictions of the two random effects for each of the 194 communes using the 
posterior means of the random effects. The mu and conditional() options 
obtain the conditional mean E(y:;|xi;,2i;, Už), where uj may be set to 0 
(conditional (fixedonly) ), set to the posterior means 
(conditional (ebmeans) ), or set to the posterior modes 
(conditional (ebmodes) ). The mu and marginal options numerically integrates 
out the random effects and computes E'(y;;|x;i,;,Z:;). The formulas for these 
methods are detailed in Methods and formulas in [ME] meglIm postestimation. 
We have 


* Various predictions of random effects and of mean 
. estimates restore mep_rob 
(results mep_rob are active now) 


. predict re_ebmeans*, reffects ebmeans 
(calculating posterior means of random effects) 
(using 7 quadrature points) 


. predict mu_fixed, mu conditional (fixedonly) 


. predict mu_ebmeans, mu conditional (ebmeans) 
(predictions based on fixed effects and posterior means of random effects) 
(using 7 quadrature points) 


. predict mu_ebmode, mu conditional (ebmode) 
(predictions based on fixed effects and posterior modes of random effects) 


. predict mu_marg, mu marginal 


summarize re_ebmeans* pharvis mu_fixed mu_ebmeans mu_ebmode mu_marg 


Variable Obs Mean Std. dev. Min Max 
re_ebmeans1 27,753 -.0551694 . 3669602 -.6321057 1.159421 
re_ebmeans2 27,753 -.0723159 .6324277 -1.792576 1.307337 

pharvis 27,753 .5110439 1.312606 (0) 30 
mu_fixed 27,753 .694688 6.587993 . 1351856 1023.213 
mu_ebmeans 27,753 .506191 .8097099 . 0330438 32.45189 
mu_ebmode 27,753 .5121716 .8178656 .0349858 32.58508 
mu_marg 27,753 69.43721 10901.27 . 1654853 1815856 


The first two rows list the posterior mean estimates of the random effects for the 
intercept and for illness. Prediction of the mean evaluated at either posterior 
means or posterior modes of the random effects gives an average prediction 


close to the sample average of the dependent variable, while prediction setting 
the random effects to 0 in this nonlinear model does not, because 

E(yij|Xij, Zij, Uj = 0) = exp(x;;6) A E(yij|Xij, Zij). The final row shows 
that there is clearly a problem in this example with estimates of F (yij|Xij, Zij) 
that differ greatly from the estimate of the sample average of the dependent 
variable. This is not pursued further for this pedagogical example. 


For margins and MEs, interest lies in E'(y;;|x;,;, Zij), which can be obtained 
using the marginal option of the margins postestimation command. The 
alternative is the conditional (fixedonly) option that uses 
E(yij|Xij, Zij, Uj = 0). As already noted, prediction using the marginal option 
had problems, so instead, for illustrative purposes, we compute AMEs when the 
random effects are set to 0. We have 


. margins, dydx(*) predict(mu conditional (fixedonly) ) 


Average marginal effects Number of obs = 27,753 
Model VCE: Robust 


Expression: Predicted mean, fixed portion only, predict (mu 
conditional (fixedonly) ) 
dy/dx wrt: I1nhhexp illness 


Delta-method 


dy/dx std. err. Zz P>lz| [95% conf. interval] 
lnhhexp - . 0894256 . 0328357 -2.72 0.006 -. 1537824 - .0250687 
illness . 6643391 . 1717332 3.87 0.000 . 3277483 1.00093 


23.5 Linear structural equation models 


Structural equation models (SEMs) describe a very wide range of models, 
including standard regression models and standard latent variable models. 
The sEM framework goes by various names in various disciplines, including 
covariance structure analysis, causal models, path analysis, and latent 
variable modeling. Unlike FMMs and mixed models, SEMs handle endogenous 
regressors. 


Models that fall into the SEM class and that can be estimated using the 
sem command, with the usual Stata command given in parentheses, include 
linear regression (regress) and seemingly unrelated equations (sureg); 
latent variable models such as linear mixed models (mixed) and factor 
analysis (factor); linear models with endogenous regressors such as single- 
equation instrumental-variables (Iv) (ivregress) and multiequation 
simultaneous equations models (reg3). 


Stata provides two commands. Command sem is for linear models and 
enables estimation of richer variants of the preceding models, with 
endogenous variables and latent variables, and permits estimation of some 
basic models for which there are no specific Stata commands, such as 
measurement-error models. 


Command gsem for generalized structural equation models (GSEMs) 
extends the methods to a wide class of nonlinear models, allows use of 
factor-variable notation, and can be used to estimate structural systems of 
nonlinear equations. For example, it provides a structural way to specify and 
fit a count model with an endogenous regressor. The gsem command is 
presented in section 23.6. 


In both cases, SEMs place restrictions on the mean vector and covariance 
matrix of observable variables. Estimation is then by ML under the 
assumption of joint normality of observables and unobservable latent 
variables. For linear SEMs, an alternative estimator relaxes the normality 
assumption to obtain parameter estimates that set sample first and second 
moments close to the first and second moments implied by the SEM. 


Several caveats are in order. First, it is possible to specify models that are 
not identified, such as in the well-known case of simultaneous equations 
models. 


Second, SEMs involve strong parametric assumptions on unobservables. 
Linear SEMs are robust to nonnormality of observable variables, provided 
their mean vector and variance matrix is correctly specified. But SEMs that 
include latent variables may be inconsistently estimated if the latent 
variables are not normally distributed. And for GSEMs, nonnormality leads to 
estimator inconsistency. 


Third, estimation of more complex models, particularly GSEMs, can be 
computationally challenging, and some additional effort, such as choosing 
good starting values, may be necessary in order for estimators to converge. 


The SEM framework is used extensively in social sciences other than 
economics. An attraction to many practitioners is that models can be 
represented without the use of algebra. Instead, models are specified using 
path diagrams that outline the links among and between observed variables 
and latent variables. Stata provides an interactive SEM builder that facilitates 
drawing path analysis diagrams. 


In this section, we provide an introduction to SEMs oriented to 
econometricians unfamiliar with sEMs, followed by a measurement-error 
example that introduces a latent variable, and a brief discussion of models 


with endogenous regressors. We do not cover use of SEM for linear mixed 
models. We conclude with a general presentation of the estimation methods. 


23.5.1 SEMs for linear regression 


We explain the SEM approach using the familiar linear regression model 


yi = at+x,8t+e 


where here the intercept is deliberately treated separately from the slope 
coefficients, x; is a (K — 1) x 1 vector, and we assume independent 


observations. 


The usual approach taken by econometricians is to model y; conditional 
on Xi. Estimation is based either on the conditional mean E (y;|x;) and 
conditional variance Var(y;|x;) or on the conditional density f(y; |x;). 


The SEM approach instead models (y;, x;) jointly. Estimation is based 
either on the unconditional means [E (y;) and E(x;)] and variances and 
covariance of yi and x; or on the unconditional joint density f(y;,x;). 


Conditional on regressors approach 


Taking a conditional mean and variance approach, and assuming 
homoskedastic errors, a moment-based approach yields the ordinary least- 
squares (OLS) estimator, with 


Bors = [So — X) (Xi -xy| X (x: - F) (vi —7) 
Gots =7 —X'B 


A fully parametric conditional approach specifies the conditional 
distribution of y; given X;. If e;|x; ~ N(0,07), then 
yilxi ~ N(a +x‘, 07). The ML estimates @, 3, and G2 maximize the 
associated log-likelihood function. The ML estimates of & and B turn out to 
be identical to the oLs estimates, while Ẹ? = 1/N a iyi — @— x! 3)?. 


Moment-based estimation for SEMs 


The moment-based variant of the SEM approach is based on mean and 
variance assumptions. We add the assumption that x; has mean Hx and 
variance matrix Xx. As before, we assume that y;|x; has conditional mean 
a + x;,@ and conditional variance g2. 


These assumptions imply that unconditionally, y; has mean 
Hy = a+ H46 and variance oyy = 3’, + o° and that Cov 
(yi, Xi) = Uxx/3. It follows that the covariance matrix of (y;, x;) is 


y= | Oyy Xyx | pan | B'Exxb +0? epe (23.3) 


Xxl Xxx 


Dixy Xxx 


The lower-left entry in (23.3) implies that Uy. = Xxx or that 
B=>d Dixy: If we replace the covariance matrices U1, and Ux, by their 
usual sample estimates, we obtain 


z 1 
Batwa 


But this is just the OLS estimator of 3. The upper-left entry in (23.3) implies 
that o? = oy, — 3’ Xx x3, which after some algebra can be shown to yield 


a= 1/ N5} (yi v= x! 3)”. The mean vector of (y;, x;) is 


m-a- yÈ a) (uT) 


N 


w=1 


— | My = at uB | 
[aje aa 


The upper entry implies a = uy — u46. If we replace the means Hy and Hy 
by their usual sample estimates, we obtain & = 7 — x’ B. 


To summarize, the restrictions on the covariance matrix of (y;, x;) led to 
estimates of 3 and g?, and restrictions on the mean of (y;,x;) led to an 
estimate of a. The Stata documentation refers to this moment-based SEM 
approach as an asymptotic distribution free method because consistency 
requires only correct specification of means, variance, and covariances. 


Maximum likelihood estimation for SEMs 


A second SEM approach, the sem default, is fully distributional. We suppose 
that (y;,x;) is joint normally distributed with the variance matrix given in 
(23.3). 


Define the -dimensional data vector z; = (y;,x;,) and the (K + 1)- 
dimensional parameter vector 0 = (3, a, 07), and let &(@) denote the final 
matrix in (23.3) and u(0) denote the final vector in (23.4). Then the log- 
likelihood function for NV independent observations is 


In L(@) = -5 [NK n27 + NIn|¥(8)| 
+ Sai — 2(8)}5() Hz — (9) ¥] (23.5) 


Minimization with respect to @ = (3, a, 07) can be shown to again yield 
. ~N EnA p N PN Fa 
the OLS estimates Bars and ors, and G2 — (1/N)>>,-4 (yi — @ - x! 3)?. 


While joint normality is assumed to obtain In L(@) in (23.5), the 
estimator is a quasi-maximum likelihood estimation (MLE) that is consistent 
provided only that u(0) and X (0) are correctly specified. Heteroskedastic- 
robust and cluster—robust standard errors can be used rather than default 
standard errors that assume homoskedasticity. 


23.5.2 SEM linear regression example 


Before providing more details on the sem command, we provide a linear 
regression example, using the same doctor-visits example as in section 20.3 
but treating the dependent variable docvis as a continuous variable, rather 
than as a count. In section 23.6, we will use the gsem command and analyze 
doctor visits as a count. 


Using the sem command to fit a linear regression model, we obtain 


* SEM method ml: Linear regression example 
. qui use mus220mepsdocvis, clear 


sem (docvis <- private medicaid age age2 educyr actlim totchr) 


Endogenous variables 
Observed: docvis 


Exogenous variables 


Observed: private medicaid age age2 educyr actlim totchr 


Fitting target model: 


Iteration 0: log likelihood 


-64884 .907 


Iteration 1: log likelihood = -64884.907 


Structural equation model 
Estimation method: ml 


Log likelihood = -64884.907 


Number of obs = 3,677 


OIM 

Coefficient std. err. Zz P>lz| [95% conf. interval] 

Structural 

docvis 

private .9193268 . 2487285 3.70 0.000 .4318279 1.406826 
medicaid . 6333396 .3417109 1.85 0.064 -.0364015 1.303081 
age 1.963495 . 4464282 4.40 0.000 1.088512 2.838478 
age2 -.0129913 .0029699 -4.37 0.000 -.0188121 -.0071705 
educyr . 1915361 .0321189 5.96 0.000 . 1285842 . 2544879 
actlim 1.369498 . 2675973 5.12 0.000 .8450171 1.893979 
totchr 1.883071 .0892863 21.09 0.000 1.708073 2.058069 
_cons -73.44753 16.68711 -4.40 0.000 -106.1537 -40.7414 
var(e.docvis) 45.86223 1.069605 43.81304 48.00727 


LR test of model vs. saturated: chi2(0) = 0.00 Prob > chi2 =. 


The sem command distinguishes between endogenous and exogenous 
variables; here the dependent variable docvis is the endogenous variable. 
The default estimation method for the sem command is MLE under joint 
normality of all observable variables, here docvis and the seven regressors. 
The default standard errors use minus the inverse of the observed Hessian. 
The likelihood-ratio (LR) test given at the end is appropriate if the model is 
overidentified. Here the model is just identified, so the test is not applicable. 


The following code compares 1) sem MLE (default method m1) with 
default variance—covariance matrix of the estimator (VCE); 2) sem MLE with 
robust VCE; 3) sem estimated using just mean and variance restrictions 
(method aaf); 4) OLS with default vcE; and 5) OLS with robust vce. Here by 


“robust data-generating process” (DGP), we mean heteroskedastic robust. A 


robust VCE is not available for sem with method aaf. 


. * SEM methods compared with OLS: Linear 
<- private medicaid age 


. qui sem (docvis 
. estimates store 
. qui sem (docvis 
. estimates store 
. qui sem (docvis 


. estimates store 


SEMMLE 
<- private 
SEMMLErob 
<- private 
SEMADF 


. qui regress docvis private 


. estimates store 


OLS 


. qui regress docvis private 


. estimates store 


OLSrob 


medicaid 


medicaid 


medicaid 


medicaid 


age 


age 


age 


age 


regression example 
age2 educyr actlim totchr) 


age2 


age2 


age2 


age2 


educyr 


educyr 


educyr 


educyr 


actlim totchr), vce(robust) 


actlim totchr), method(adf) 


actlim totchr 


actlim totchr, vce(robust) 


We obtain the following parameter estimates, standard errors, and 
associated diagnostic statistics. 


. * SEM methods compared with OLS: Table of estimates 
. estimates table SEMMLE SEMMLErob SEMADF OLS OLSrob, 
> eq(1) b(%10.4f) se stfmt(410.2f) stats(rmse r2 11 critvalue N) 


Variable SEMMLE SEMMLErob SEMADF OLS OLSrob 
#1 
private 0.9193 0.9193 0.9193 0.9193 0.9193 
0.2487 0.2386 0.2374 0.2490 0.2388 
medicaid 0.6333 0.6333 0.6333 0.6333 0.6333 
0.3417 0.4090 0.4067 0.3421 0.4094 
age 1.9635 1.9635 1.9635 1.9635 1.9635 
0.4464 0.4049 0.4021 0.4469 0.4052 
age2 -0.0130 -0.0130 -0.0130 -0.0130 -0.0130 
0.0030 0.0027 0.0027 0.0030 0.0027 
educyr 0.1915 0.1915 0.1915 0.1915 0.1915 
0.0321 0.0318 0.0315 0.0322 0.0318 
actlim 1.3695 1.3695 1.3695 1.3695 1.3695 
0.2676 0.2946 0.2934 0.2679 0.2948 
totchr 1.8831 1.8831 1.8831 1.8831 1.8831 
0.0893 0.1017 0.1011 0.0894 0.1018 
_cons -73.4475 -73.4475 -73.4475 -73.4475 -73.4475 
16.6871 15.1800 15.0763 16.7053 15.1945 
var (e.docvis) 45.8622 45.8622 45.8622 
1.0696 5.9428 5.8999 
Statistics 
rmse 6.78 6.78 
r2 0.16 0.16 
11 -64884.91 -64884.91 -12250.88 -12250.88 
critvalue -64884.91 -64884.91 0.00 
N 3677 3677 3677 3677 3677 


Legend: b/se 


As expected, for linear regression, the two SEM methods of estimation are 
equivalent to OLS, so coefficient estimates are the same in all five columns. 
The default standard errors for the two SEM methods and ozs differ very 
slightly because of different degrees-of-freedom corrections. Similarly, the 
two sets of robust standard errors differ very slightly because of different 
degrees-of-freedom corrections. 


The SEM estimate of the model error variance g2 is 45.8622. For OLS, the 
rmse is an estimate of o and equals 6.7795. Then 6.77952 = 45.9616, which 
differs slightly from 45.8622 because of different degrees-of-freedom 


correction. 


The log likelihoods following the sem and regress commands differ 
greatly. This is because the regress log likelihood is based on the 
conditional distribution of y:i given x;, whereas the ML version of sem log 
likelihood is based on the joint distribution of yi and Xj. 


The adf method for the sem command minimizes the objective function 
defined in (23.7) below. For a just-identified model, the case here, this 
objective function attains a minimum value of zero. 


23.5.3 SEM model specification 


SEMs have three components: 1) observed variables; 2) latent variables; and 
3) model errors. 


A latent variable is a variable that is not observed. Usually, it is a 
variable that ideally we would observe. For example, in a log-earnings 
regression, the latent variable might be true years of schooling, while 
observed schooling is measured with error. A model error is a special kind of 
latent variable that is purely random. 


A variable is exogenous if it is determined outside the system, so that no 
variable determines it but it may determine other variables. A variable is 
endogenous if it is determined by the system, so that it is determined in part 
by variables in the system. Observed variables and latent variables can be 
exogenous or endogenous, while model errors are always exogenous. Note 
that the SEM use of “endogenous” differs from the econometricians’ use of 
the term. In particular, an SEM endogenous regressor need not necessarily 
lead to endogeneity bias. 


An SEM can be represented mathematically. Then an endogenous variable 
is one that at some point appears on the left-hand side of an equation, while 
exogenous variables always appear on the right-hand side. 


An alternative way to represent an SEM, one very popular in the social 
sciences, is using a path diagram. Then arrows are used to represent 
relationships between variables. A variable is endogenous if any paths 
(arrows) point to it; otherwise, it is exogenous. In a path diagram, observed 


variables are represented using boxes, while latent variables are represented 
using circles. 


23.5.4 The sem command 


The sem command is very complicated. We provide a very brief summary. 
The syntax is 


sem paths [ af | lin] | weight | |, options | 


where the path defines the SEM in command-language path notation; 
examples are given below. 


The default is estimation by ML (method m1), in which case the vce () 
options include vce (robust) and vce (cluster clustvar). An alternative is 
moment-based estimation (method aaf), which does not have the preceding 
vce () options and instead computes standard errors based on the observed 
Hessian (the default) or the expected Hessian. Options vce (bootstrap) and 
vce (jackknife) are also available. 


The basic sem command has default treatments of covariances, variances, 
and means of latent variables. The covariance (), variance (), and mean () 
options allow these to be overridden. 


Latent variables are given names that must begin with a capital letter, 
while the names of observed variables must begin with a lowercase letter. 
For example, if variable x is observed, it is denoted x, while if it is a latent 
variable, it is denoted x. 


Every endogenous variable, whether latent or observed, has an error 
attached with it that has prefix e. For example, the endogenous variable y 
has error e. y, and the latent variable L has error e. L. 


Latent variables require a normalization because these variables are not 
naturally scaled. These normalizations are presented in section 23.5.10. 


If variable x determines y, we can write either (y <- x) or (x -> y). 
Paths can be combined so, for example, (y <- x1) (y <- x2) is equivalent 


to (y <- x1 x2). 


The default is to assume that 1) all exogenous variables (observed or 
latent) are correlated with each other; 2) all error variables are uncorrelated 
with each other; and 3) endogenous variables are not directly correlated with 
each other, though they may be via their error. These defaults can be 
overridden using the covariance () option. For example, cov (namel 
name2) is used to specify that name! and name2 are uncorrelated variables, 
while cov (namel *name2eo) is used if they are correlated. 


The default is to assume that 1) all observed exogenous variables have 
nonzero mean; 2) all latent exogenous variables have zero mean; 3) all error 
variables have mean zero; and 4) endogenous variables have no separate 
mean. The default for latent exogenous variables and for observed 
exogenous variables in the special case of option noxconditional can be 
overridden using the mean() option. 


Constraints can be introduced using e (the “at” symbol). An exact value 
can be given, such as L@1, or a symbolic such as Lec if, for example, an 
equality constraint is used. 


Equations for observed endogenous variables are assumed to have a 
constant. For latent endogenous variables, there is not a constant unless you 
ask for it. 


A recursive model is one that does not have any feedback loops, such as 
(yl <- y2) and (y2 <- y1), and does not have any error correlation, such 
as covar(e.yl*e.y2). Recursive models are simple to fit and do not suffer 
from potential simultaneity bias. Otherwise, the model is nonrecursive. 


With K observed variables, there are K x (K + 1)/2 unique entries in 
the sample covariance. Let p be the number of model parameters, excluding 
the intercepts and means. If p > K(K + 1)/2, the model is not identified, 
and parameter estimates are meaningless. If p = K x (K + 1)/2, the model 
is just identified. If p < K x (K +1)/2, the model is overidentified, and 
one can compute an overidentifying restrictions test. 


23.5.5 Some sem command examples 


Table 23.1 presents sem commands that yield the same results as some 
leading standard Stata estimation commands. Additional examples include 


linear mixed models. 


Table 23.1. sem commands and equivalent standard Stata commands 


Estimator 


Usual command and sem equivalent 


Sample mean 


mean y 
sem y 


Correlation 


t test 


correlate y x 
sem y x, standardized 


ttest y, by(d) 
sem y <- d 


OLS regression 


regress y x1 x2 
sem y <- x1 x2 


Multivariate regression 


mvreg y1 y2 = x1 x2 
sem yl y2 <- x1 x2, cov(e.y1*e.y2) 


Seemingly unrelated 
regressions 


SUR with 
parameter constraints 


sureg (y1 x1 x2) (y2 x1 x3), isure 
sem (y1 <- x1 x2) (y2 <- x1 x3), cov(e.y1*e.y2) 


constraint 1 [y1]x1 = [y2]x1 

sureg (y1 x1 x2) (y2 x1 x3), constraint (1) 

sem (y1 <- x1@a x2) (y2 <- x1@a x3), /// 
cov(e.y1*e.y2) 


IV regression 


ivregress 2sls y1 (y2 = x2) x1 
sem (y1 <- y2 x1) (y2 <- x1 x2), /// 
cov(e.yixe.y2) 


Three-stage 
least squares (3SLS) 


reg3 (y1 = y2 x1 x2 x3) (y2 = y1 x1 x4), sure 
sem (y1 <- y2 x1 x2 x3) (y2 <- y1 x1 x4), /// 
cov(e.y1*e.y2) 


For SUR and two-stage least squares (2SLS), the default for the sem command 
is to assume the two errors are uncorrelated, so the cov(e.yl*e.y2) option 
needs to be added. Otherwise, sem would merely provide OLS estimates of 


each equation, which would be inconsistent because of simultaneous 
equations bias. 


23.5.6 sem path builder 


Stata provides a path builder that enables one to draw a path analysis 
diagram. Given this path diagram, it automatically generates the sem 
command to fit the model. 


We provide two examples of path diagrams in the following analysis. 
Many examples are given in the Structural Equation Modeling Reference 
Manual. The interactive SEM Builder can be initiated in Stata from Statistics 
> SEM (structural equation modeling) > Model building and 
estimation. 


23.5.7 Measurement-error example 


For econometrics, the sem and gsem commands are especially useful for 
estimation of models with latent variables. 


To illustrate this, we consider a case where a regressor is measured with 
classical measurement error and we have several independent measures of 
the unobserved latent variable. We show the inconsistency of OLS and how 
the sem command can be used to obtain consistent estimates. We also show 
how the several independent measures can be viewed as being generated by 
a common latent variable. 


The true DGP is 


Yi = OF zi + zi tui 


The problem is that we observe x with error. Instead, we have four measures 
of x, denoted x1, x2, x3, and x4. The first measure is generated by a classical 
measurement-error model as 


2 
Lii = Li + Eli, Eli ~ (0, oii) 


OLS regression of y on z and zı will yield inconsistent estimates of 3, and 
Bz, because of the measurement-error problem. The other three measures are 
similarly generated. Note that the measurement errors €1;, €2i, €3i, and €4; 
are independent. We have 


* SEM measurement error example: Generate the data 
clear 


. qui 

set 
. gen 
. gen 
. gen 
. gen 
. gen 
. gen 


. gen 


set obs 1000 

seed 10101 

x = rnormal (0,1) 

z = x + rnormal(0,1) 


y = 1+ 1*x + 1*z + rnormal (0,1) 


xi = x + rnormal(0,1) 
x2 = x + rnormal(0,1) 
x3 = x + rnormal(0,1) 
x4 = x + rnormal(0,1) 


First, suppose that the true regressor x could be observed, so we can 
regress y on z and x. Using the sem command, we obtain 


. * SEM measurement error: OLS (using SEM) with true regressor x consistent 
. sem y <- z x, nolog vce(robust) 


Endogenous variables 
Observed: y 


Exogenous variables 
Observed: z x 


Structural equation model Number of obs = 1,000 
Estimation method: ml 


Log pseudolikelihood = -4277.4193 


Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 
Structural 
y 
Zz 1.057966 .0328901 32.17 0.000 . 993503 1.12243 
x . 9488263 .044741 21.21 0.000 .8611356 1.036517 
_cons 1.024673 .0317183 32.31 0.000 . 9625066 1.08684 
var(e.y) 1.007609 .0451899 .9228197 1.100189 
1 


As expected, OLS is consistent, and all estimated coefficients are close to 
their DGP values of 1. 


Now, suppose we use the mismeasured regressor x1 and regress y on z 
and x1. We obtain 


. * SEM measurement error: OLS with mismeasured regressor x1 inconsistent 
. sem y <- z xl, nolog vce(robust) 


Endogenous variables 
Observed: y 


Exogenous variables 
Observed: z x1 


Structural equation model Number of obs = 1,000 
Estimation method: ml 


Log pseudolikelihood = -4919.2122 


Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 
Structural 
y 
Zz 1.371866 .0299465 45.81 0.000 1.313172 1.43056 
xi . 3380016 . 0300409 11.25 0.000 . 2791224 . 3968807 
_cons 1.019155 .0361965 28.16 0.000 9482112 1.090099 
var(e.y) 1.305362 .0569675 1.198349 1.42193 


Now OLS leads to inconsistent estimation. Note that the coefficient of the 
mismeasured regressor x1 is many standard errors from the DGP values of 1 
and that there is a spillover effect to the coefficient of z. The 
mismeasurement also leads to a poorer model fit because the estimated error 
variance is now substantially more than 1. 


The standard solution is Iv regression. For example, we could include x1 
as a regressor and use x2, x3, and x4 as instruments. This can be estimated 
using the command ivregress 2sls y z (xl = x2 x3 x4). 


Using the sem command instead, we have 


. * SEM measurement error: IV with x2, x3, x4 instruments with xi consistent 
. sem (y <- z x1) (x1 <- z x2 x3 x4), covar(e.y*e.x1) nolog noheader vce(robust) 


Endogenous variables 
Observed: y x1 


Exogenous variables 
Observed: z x2 x3 x4 


Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 
Structural 

y 
xi 1.003994 .0881314 11.39 0.000 .8312594 1.176728 
Zz 1.048039 .0508024 20.63 0.000 . 9484687 1.14761 
_cons 1.044999 .0442263 23.63 0.000 .9583173 1.131681 

x1 
Zz .2116979 .0303231 6.98 0.000 . 1522657 .27113 
x2 . 2241763 .02473 9.06 0.000 . 1757063 . 2726463 
x3 . 1982505 .023787 8.33 0.000 . 1516289 . 244872 
x4 . 1320508 .0239545 5.51 0.000 .0851008 . 1790008 
_cons -.0277846 .0343133 -0.81 0.418 -.0950374 .0394683 
var(e.y) 1.937652 . 1829044 1.610374 2.331443 
var(e.x1) 1.169616 .0509586 1.073885 1.273882 
cov(e.y,e.x1) -.9493958 . 1126678 -8.43 0.000 -1.170221 -.728571 


The first equation estimates are close to the DGP values. The standard errors 
of z and x1 are 0.0508 and 0.0881 compared with preceding standard errors 
of z and the true regressor x of 0.0329 and 0.0447. The sem command 
obtains ML estimates. Thus, for this overidentified model, the equivalent 
ivregress 2sls command gives somewhat different estimates, but the 
ivregress liml command gives identical estimates. 


23.5.8 Measurement-error model estimates using a latent variable 


As an alternative method to obtain consistent estimates, we will perform OLS 
regression with the mismeasured regressor x1 replaced by a latent variable x 
that determines the mismeasured observed variables x1, x2, and x3. This 
example illustrates the use of latent variables. 


SEM for the mismeasured regressor 


Before moving to the desired regression, we model the latent variable using 
a one-factor measurement model. Such models are widely used in 
psychology, for example, where we might have three noisy measures of 
ability that are viewed as being generated by a single unobserved latent 
variable for ability. 


Thus, we suppose there is an unobserved or latent variable X underlying 
each of z1, £2, and z3. That is, X; ~ (0, o3) and 


Tii = O,+NXi teu, En ~ (0,07) 
Tzi = Q2 +Y2Xi +e, Ezi ~ (0,02) 
£3i = a3 + Y3Xi + Ezi, Ezi ~ (0, 03) 
The latent variable X; requires a normalization. The sem command does 
this by setting yı = 1. With this normalization and given the DGP, we expect 
that, approximately, the intercepts a, = œ> = a3 = 0, the coefficients of the 


latent variable y1 = y2 = y3 = 1, while the three error variances should be 
equal to 1. 


Figure 23.4 presents a path diagram for this model. The observed 
variables x1, x2, and x3 are represented in rectangles, the latent variable x is 
represented in a circle and is capitalized, and the three errors, also not 
observed, are represented in circles. The errors are statistically independent, 
and each determines only one variable. The single latent variable x 
determines all three of the observed variables. 


Figure 23.4. Path diagram for measurement-error model 


For the sem command, the observed variable names must be in 
lowercase, while uppercase is reserved for any latent variables, in this case x. 
We have 


. * SEM measurement error: Measurement model for X using 3 measures x1, x2, x3 
. sem (x1 x2 x3 <- X), nolog 


Endogenous variables 


Measurement: 


x1 x2 x3 


Exogenous variables 


Latent: X 


Structural equation model 
Estimation method: ml 


Log likelihood = -4932.9313 


Number of obs = 1,000 


(1) [xi]xX = 1 
OIM 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
Measurement 
x1 
X 1 (constrained) 
_cons -.015836 .0436283 -0.36 0.717 -.1013458 . 0696739 
x2 
X . 9067773 .0597044 15.19 0.000 . 7897588 1.023796 
_cons .0237131 .0429141 0.55 0.581 -.0603971 . 1078232 
x3 
X 1.05723 .0694522 15.22 0.000 .9211066 1.193354 
_cons .0208704 .0461941 0.45 0.651 -.0696684 .1114093 
var(e.x1) . 9230685 .0699544 . 7956572 1.070883 
var (e.x2) 1.035528 .0655821 .9146467 1.172385 
var (e.x3) 1.038118 .078359 . 8953573 1.203641 
var (X) . 9803578 .0934423 .8133038 1.181725 
LR test of model vs. saturated: chi2(0) = 0.00 Prob > chi2 =. 


The parameter estimates are close to their expected values of 0 for intercepts 


and 1 for the remaining parameters. Note the SEM terminology that the 


measured variables x1, x2, and x3 are viewed as endogenous because they 
are determined by another variable. This other variable, the latent variable x, 
is viewed as exogenous because it is determined by no variable, only by a 


model error. 


What if we had used a different value normalization of the latent 


variable? Suppose we constrain the first slope coefficient to be 2 rather than 
1. Then z1; = a, + 1 x X; + wi; becomes 71; = ay + 2 x (X;/2) + wu, 
so setting 7; = 2 is equivalent to halving the latent variable. Otherwise, 


there is no change. So the only coefficients that should change are y2 and 73, 
which double, and the estimated variance of the latent variable, which is one 
quarter as large. From output not given, the command sem (x1 <- x@2) 

(x2 <- X) (x3 <- X) confirms that this is indeed the case. 


An alternative normalization is to normalize the variance of the latent 
variable. For example, giving the command sem (x1 x2 x3 <- X), 
var (X@0.25) leads to the three estimated slope coefficients 71, Y2, and Y3 
changing by the same multiple, while the estimated intercepts and the 
estimated error variances are unchanged. 


From theory given below, SEM coefficients are identified from the 
covariance matrix of observed variables. In this example, there are 3 
observed variables, so the 3 x 3 covariance matrix has 3 x 4/2 = 6 unique 
entries. There are nine parameters. Three are identified by the means of the 
three observed variables x1, x2, and x3. The remaining six are identified by 
the covariance matrix, so the model is exactly identified. Thus, the LR test at 
the end of the output cannot be implemented. 


What if we used only two of the measures? With only two measures, the 
covariance matrix has 2 x 3/2 = 3 entries, but there are four parameters, 
and the model is not identified. In fact, the command sem (x1 x2 <- X) 
leads to a warning message that the model is not of full rank. Consequently, 
the variance of the error for variable x1 was very small (3.12 x 107°), and 
no confidence intervals were given for this variance or for the variance of 
the latent variable. 


With all four measures used, the covariance matrix has 4 x 5/2 = 10 
entries and 8 parameters. The model is overidentified, and the LR test has 2 
degrees of freedom. From output not given, the y? (2) test statistic has value 
6.12 and p-value of 0.047, so at level 0.05, we would just reject the 
constraints implied by the measurement model. Given the DGP, we do not 
expect such a rejection, though a rejection will occur in 5% of simulations. If 
we reset the seed generating the data to 12345, for example, then we find 
that the test statistic equals 0.65 with p = 0.724. 


SEM regression controlling for mismeasured regressor 


We now move onto estimation of the model for y, with the mismeasured 
variable x1 replaced by the latent variable x obtained from the preceding 
one-factor measurement model. We have 


Yi = b1 Boze + Pads + ui 


where additionally we have data on x1, x2, and x3 that are related to x. 


Figure 23.5 presents a path diagram for this model. The observed 
dependent variable y is determined by the observed variable z and the latent 
variable x. The latent variable x determines the three observed measurements 
x1, x2, and x3. The four errors are €1, €2, €3, and u. 


Figure 23.5. Path diagram for regression with measurement error 


We have 


. * SEM meas error: regress y on z and x controlling for measurement error in x 
. sem (x1 x2 x3 <- X) (y <- z X), nolog vce(robust) 
Endogenous variables 
Observed: y 
Measurement: x1 x2 x3 
Exogenous variables 
Observed: z 
Latent: X 


Structural equation model Number of obs = 1,000 
Estimation method: ml 


Log pseudolikelihood = -7959.6236 
(1) ([x1]X = 1 


Robust 
Coefficient std. err. Zz P>lizl| [95% conf. interval] 
Structural 

y 
Z 1.010318 .0457044 22.11 0.000 . 9207387 1.099897 
X 1.066612 .087781 12.15 0.000 . 8945646 1.23866 
_cons 1.030882 .0384079 26.84 0.000 - 9556038 1.10616 

Measurement 
x1 
X 1 (constrained) 

_cons -.015836 .0377764 -0.42 0.675 -.0898764 .0582045 

x2 
X .9791609 .0562687 17.40 0.000 . 8688763 1.089446 
_cons .0237131 .0369525 0.64 0.521 -.0487124 .0961386 

x3 
X 1.043577 .0540005 19.33 0.000 . 9377379 1.149416 
_cons .0208704 .0399025 0.52 0.601 -.057337 .0990779 
var (e.x1) . 9634506 .0520942 .8665719 1.07116 
var (e.x2) .9404119 .0527446 . 8425138 1.049686 
var (e.x3) 1.110213 . 0647267 . 9903305 1.244607 
var(e.y) .9578786 .0570865 . 8522786 1.076563 
var (X) .9399771 .0794081 . 7965423 1.10924 


cov(z,X) . 9966285 .0489429 20.36 0.000 . 9007021 1.092555 


All coefficients are close to their DGP values. As expected, there is an 
efficiency loss compared with the case where the true regressor x is instead 
included in the regression; the standard errors of z and x are 0.0457 and 
0.0878 compared with standard errors of z and x of 0.0329 and 0.0447. 


There are 5 observed variables, so there are 5 x 6/2 = 15 covariances. 
These covariances are used to identify 11 parameters (the remaining 
parameters are identified by the means of the 5 observed variables). The 
overidentifying restrictions test, available if the vce (robust) option is 
dropped, has 15 — 11 = 4 degrees of freedom and does not reject the 
restrictions of the model at level 0.05 because p = 0.096 > 0.05. Note that 
we have assumed that the three measurement errors are independent, an 
important restriction on error covariances that ensures identification of the 
model. 


How do we interpret the model coefficients? Clearly, a one-unit change 
in the latent variable x is associated with a 1.067-unit change in y. But what 
does a one-unit change in x mean? One approach is to note that a one-unit 
change in x1, x2, or x3 is associated with an approximate one-unit change in 
x. So we might view a one-unit change in the latent variable x as the same as 
a one-unit change in the observed variable x. Then a one-unit change in x is 
viewed as associated with a 1.067-unit change in y, after controlling for 
measurement error. 


Finally, we note in passing that the extent of measurement error is 
measured by the reliability ratio—the variance of the true variable divided 
by the variance of the mismeasured variable. If this is known, as it is in this 
example with known DGP, then one can use the reliability() option of the 
sem command. 


23.5.9 SEM endogenous regressors example 


As an example, we consider a just-identified two-equation model, using the 
same DGP as in section 7.2. Then yı = B12Y2 + ¥12%1 + U1, 

yo = Poryi + Y21£2 + Ug, and the errors u, and Uz are correlated. From 
section 1.2, B12 = J12 = 721 = 1 and Bar = 0.25. 


. * DGP for just-identified two-equation simultaneous equations example 
. clear all 


. set seed 10101 


. drawnorm ul u2, n(1000) corr(1, .7, 1) cstorage(lower) 
(obs 1,000) 


. drawnorm x1 x2, n(1000) corr(1, .3, 1) cstorage(lower) 


. generate y1 =(1/(1-.25))*(x1 + 1*(x2+u2) + ul) // Reduced form for y1 
. generate y2 = 0.25*y1 + 1*x2 + u2 // Generating y2 given y1 


We initially focus on estimation of just the yı equation. The sem 
command equivalent to ivregress uses the structural form for yı and the 
reduced form for y2. For joint estimation of the yı and y2 equations, the sem 
command equivalent to reg3 uses both structural form equations. 


. * SEM commands for 2SLS and 3SLS in just-identified model 
. qui ivregress 2sls y1 (y2=x2) x1, noheader 


. estimates store iv_2sls 

. gui sem (y1 <- y2 x1) (y2 <- x1 x2), covar(e.y1l*e.y2) 
. estimates store sem_rf 

. qui reg3 (y1 = y2 x1) (y2 = y1 x2) 

. estimates store iv_3sls 

. qui sem (y1 <- y2 x1) (y2 <- y1 x2), covar(e.y1l*e.y2) 


. estimates store sem_sf 


We next compare the various estimates of the parameters of the yı 
equation. 


. * SEM estimators compared with 25LS and 3S5LS in just-identified model 
. estimates table iv_2sls sem_rf iv_3sls sem_sf, eq(1) b(%9.5f) se 


Variable iv_2sls sem_rf iv_3sls sem_sf 
#1 
y2 0.95510 0.95510 0.95510 0.95510 
0.02797 0.02797 0.02797 0.02797 
x1 1.04403 1.04403 1.04403 1.04403 
0.04095 0.04095 0.04095 0.04095 
_cons 0.00481 0.00481 0.00481 0.00481 
0.03383 0.03383 0.03383 0.03383 
y2 
x1 0.41523 
0.05307 
x2 1.27681 0.92532 0.92532 
0.05322 0.04617 0.04617 
yl 0.28823 0.28823 
0.02121 0.02121 
_cons 0.04345 0.03010 0.03010 
0.05040 0.02907 0.02907 
var(e.y1) 1.14308 1.14308 
0.09368 0.09368 
var (e.y2) 2.53733 0.84140 
0.11347 0.09505 
cov(e.yl, 
e.y2) 1.40313 0.68740 
0.09954 0.06958 


Legend: b/se 


The estimates for the yı equation are equivalent because in a just-identified 
system 2SLS, limited information ML, 3SLS, and full information ML are 
equivalent. In the overidentified case, there will be finite sample differences 
because SEM estimators are ML estimators, while 2SLS and 3SLS estimators are 
moment-based estimators. 


23.5.10 SEM estimation in general 


The linear SEM approach in general specifies a linear model involving 
measured variables, such as y; and X;, and latent or unobserved variables, 


such as model errors. As detailed below, this accommodates a very wide 
range of models that are routinely used in empirical social science research. 


Let z; denote the measured variables, and let @ denote all the parameters 


of the model. Without any structure, the sample mean vector and variance 
matrix are 


A SEM, like the preceding linear regression example, implies that 


| vao | =| 20) | 


where different sems lead to different functional forms for u(0) and X (0). 
This model can be fit in two ways. 


ML estimation 


Under the assumption of normality of z; and independence over 4, the log- 
likelihood function has the same form as that in (23.5). This can be shown to 
simplify to 


In L(0) = + |x In 2r + In{det 5(0)} + trace {D=(6)*}] (23.6) 
D = Szz + {Z — (8) } {Zz — u(0) Y 
Strictly speaking, D = Sž, + {Z — w(0)}{z — w(0)}’, where Sž, 


divides the sum by N rather than N — 1. Some other SEM packages 
normalize by N, leading to slightly different results. 


This estimator is obtained using sem command option method (m1), the 
default. The estimates are robust to nonnormality in the model errors, though 
not to nonnormality in the latent variables because this leads to 
misspecification of X (0). Robust standard errors can be obtained using the 
vce (robust) option. A wide range of vce () options are available, including 
cluster and sbentler. 


The method (mimv) option uses imputation methods under the 
assumptions of missingness at random and joint normality of all variables 
(both observed and latent) when data include missing values. This leads to 
less loss of precision than if casewise deletion is used, at the expense of 
stronger distributional assumptions. 


Moment-based estimation 


An alternative approach matches sample moments and model moments, so 
that z = u(0) and S,, = > (6). The preceding linear regression model 
example was just identified, and there was a unique solution for 9. More 
generally, SEMs may be overidentified, in which case there is no unique 
solution. Then estimation is by generalized method of moments (GMM). We 
stack the sample mean vector and sample variance matrix, similarly stack the 
corresponding parameter restrictions, and define 


[aly |=! [ alloy 


where vech(.) is the vector-half operator that vectorizes the lower triangular 
half of a symmetric matrix. In the just-identified case, we could solve 
w = 7(8) for @. 


In the overidentified case, we solve a quadratic form in the discrepancy 
{w — T(0)}. Then g minimizes 


Q(9) = {w — 7(0)} W~ {w — 7(8)} (23.7) 


where W—! is a weighting matrix. This GMM estimator is often called 
weighted least squares in the SEM literature and is also called a minimum chi- 
squared estimator. If the model is just identified, all choices of W lead to the 
same estimate g and Q(6) = 0. If the model is overidentified and W=! is 
the optimal weighting matrix, then the departure of Q(6) from 0 provides a 
test of overidentifying restrictions. 


This estimator is also called the asymptotic distribution-free estimator 
and is implemented using the method (aaf) option of the sem command, 
which specifies W to be the estimated covariance matrix of w. This 
weighting matrix requires estimating the variances and covariance of the 
components of Z and S,,,. 


Consistency of this estimator requires that E{w — T (u, 0)} = 0, so the 
variance matrix X (0) needs to be correctly specified. The weighting matrix 
used is the optimal weighting matrix for GMM based on the moment 
condition E{w — T(u,0)} = 0 and assuming independence over i. So the 
default standard errors are automatically robust, provided the observations 
are independent. If observations are instead clustered, but we still assume 
that 4(0) is correctly specified, then cluster—robust standard errors can be 
obtained by using a cluster bootstrap, provided the number of clusters is 
sufficiently large. 


Standard errors and overidentifying restrictions test 


A range of standard errors can be obtained after method (m1), including 
heteroskedastic—robust and cluster—robust standard errors. However, the 
overidentifying restrictions test is possible only when default standard errors 
are used (option vce (oim) ) or when heteroskedastic—robust standard errors 
are obtained using option vce (sbentler). After method (aaf), the default 
standard errors (option vce (oim) ) are essentially heteroskedastic robust. 
Cluster—robust standard errors are then not available, though bootstrap and 
jackknife standard errors are. 


Computational considerations 


With K observed variables, there are K(K + 1)/2 variance components, so 
at most there can be K(K + 1)/2 parameters, excluding the intercepts and 


means. One sign of nonidentification is that the log likelihood is not 
changing over iterations. In that case, one can set the number of iterations to 
the number where this problem arises, using the iterate (#) option. 
Parameter estimates in the resulting output that have missing standard errors 
are signals that the parameter is unidentified. 


Numerical problems may arise. Even though the SEM is linear in 
parameters, the implied covariance matrix X (0) can be nonlinear in 
parameters, leading to numerical problems in maximizing In L(0) or 
minimizing Q(0). 


From (23.6), the ML methods just require data on Z and S,,, rather than 
on individual observations. One can just provide these rather than all the 
data. The ssa command creates a special dataset containing the values of Z 
and S,,,, that can be used to fit models with sem. 


Normalization 


Normalization constraints on the latent variables are necessary because 
these, unlike observed variables, are not naturally scaled. The model 

y = 100 +7 with ņn ~ (100, 25) is equivalent to the model y = 200 + 57 
with 7 ~ (0,1). 


To normalize the centering, Stata sets latent exogenous variables to have 
mean ( and latent endogenous variables to have intercept 0. To set the scale, 
Stata sets the coefficients on paths from latent variables to the first observed 
endogenous variable to 1 and, if there is no endogenous variable, sets the 
coefficients on paths from latent variables to the first endogenous latent 
variable to 1 if the latent variable is measured by other latent variables only. 


23.6 Generalized structural equation models 


The mixed command is relevant for multilevel linear models with latent normal 
variables for a single dependent variable. The me commands extend the mixed 
command to multilevel GLMs with latent normal variables for a single 
dependent variable. The sem command extends the mixed command to permit 
more than one dependent variable but is restricted to a single level. 


The gsem command extends the sem command to GLMs, presented in 
section 13.3.7. Furthermore, it allows for multiple levels. Simultaneity is 
permitted, but, aside from linear models, the models must be recursive. 


The gsem command provides several features not provided by the sem 
command. It enables one to 


1. fit SEMs containing generalized linear response variables; 

2. fit recursive simultaneous equation nonlinear SEMs; 

3. fit SEMs with categorical latent variables (FMMs); 

4. use factor-variable notation (not possible with the sem command); and 
5. fit multilevel mixed SEMs. 


We focus on the use of gsem commands for single-level models and their 
use in fitting more flexible parametric models with unobserved heterogeneity, 
such as for models with selection on unobservables. An example recursive two- 
equation model with one level specifies 


E(yralX1i, Yaa; ui) = g] (ayz: + X41 + ôu), yu ~ Fi 
E(Y|X, Ui) = 93 (Xub + ui), yzi ~ Fe 
Uj ~ N (0, 02) 


where gı(-) and go(-) are link functions and F} and F» are densities for 
standard GLMs. The model is recursive because a model for y;|yo is specified, 
but the model specified for y2 does not depend on 41. 


The gsem command applies to many models and data types: the normal for 
unbounded data; logit, probit, and complementary log—log for binary data; 
multinomial, ordered logit, ordered probit, and ordered complementary log—log 
for multinomial data; beta for continuous data on (0, 1); lognormal, 


exponential, gamma, loglogistic, and Weibull for continuous positive data; and 
Poisson and negative binomial for count data. 


23.6.1 GSEM estimation 


Models with continuous latent variables are fit by ML. Let y denote the vector 
of dependent variables, X the regressor matrix, and u the vector of latent 
variables. If u was observed, the likelihood would be the joint density of the 
data f(y|X, u, 0). Instead, u is an unobserved normally distributed variable 
that needs to be integrated out. So the likelihood function is 


L(0) = J f(y|X,u, 6)¢(u|p,, Su)du 


where ġ(z|u, ©) denotes the density of the multivariate normal with mean 4 
and variance 3}. 


This integral has no closed-form solution, aside from exceptions such as 
linear SEMs, and is instead computed numerically using quadrature methods. 
And, as for the me commands, any model misspecification will lead to 
inconsistent parameter estimates. 


23.6.2 The gsem command 


The gsem command, like the sem command, is very complicated. We provide a 
very brief summary. The syntax is 


gsem paths lif ] [ in | | weight | E options | 


where the path defines the GSEM in command-language path notation; examples 
are given below. 


The family() and 1ink() options define the particular GLM used. The 
options covariance (), variance (), and mean() override the defaults for the 
latent variables. The 1class() option defines latent classes. 


The gsem command provides four numerical integration methods via the 
option intmethod():mvaghermite (the default), mcaghermite, ghermite, and 
laplace. The methods were described in more detail in section 23.4.1. 


23.6.3 Some GSEM examples 


Table 23.2 presents some example gsem commands. The first five examples are 
identical to some leading standard Stata estimation commands and yield 
exactly the same results. The last two entries present two recursive structural 
model examples that cannot be fit using other commands. The first of these 
examples has independent errors, while the second introduces error correlation 
by the latent variable L. Uppercase is reserved for latent variable names, while 
observed variables must be in lowercase. 


Table 23.2. gsem commands and equivalent standard Stata 


commands 


Estimator 


Usual command and SEM equivalent 


Logit 


logit y x1 x2 
gsem y <- x1 x2, logit 
gsem y <- x1 x2, family(binomial) link(logit) 


Multinomial logit 


mlogit y 


x1 x2, baseoutcome(1) 


gsem y <- x1 x2, mlogit 


Ordered logit 


ologit y 


xi x2 


gsem y <- x1 x2, ologit 


Poisson regression 


poisson y x1 x2 
gsem y <- x1 x2, poisson 


Linear mixed model 


mixed y x || id: 
gsem (y <- x M1[id]) 


Finite mixture: 
two count outcomes 


gsem (y1 


y2 <- x1 x2), poisson lclass(C 2) 


Recursive exogenous: 
binary and ordinal 


gsem (yl 
(y2 <- 


<- x1 x2, logit) 
z1 z2, ologit) (y2 <- y1) 


Recursive endogenous: 
count and continuous 


gsem (y1 
(y2 <- 


<- y2 x1 L, poisson) 
x2 L) 


Note that the gsem command permits factor-variable notation, such as 
c.xl##c.x2. While recursive models, such as the last two models in the table, 
can be fit, gsem does not cover more general simultaneous equations with yı 
depending on Y2 and y2 depending on yı unless the model is a linear model. 


23.6.4 Poisson with latent normal variable for overdispersion 


For count-data regression, the Poisson distribution is too restrictive. The 
negative binomial can be obtained as a Poisson-gamma mixture. An alternative 
model that accommodates overdispersion is a Poisson-normal mixture that, 
unlike the negative binomial, has no closed-form solution. 


The Poisson-normal mixture model specifies that Y; conditional on 
regressors x; and an unobserved heterogeneity term u; are Poisson distributed 
with mean exp(x/3 + u;), where ui is N (0, ož). 


This model can be fit by the gsem command with model option poisson 
and introducing a latent variable, denoted here by 1, that is normally distributed 
with mean 0, coefficient normalized to equal 1, and variance to be estimated. 
For the doctor-visits example given in section 20.5.4, we obtain 


. * GSEM: Poisson normal mixture model 
. qui use mus220mepsdocvis, clear 


. global xlist2 medicaid age age2 educyr actlim totchr 
. gsem (docvis <- private $xlist2 L, poisson), nolog vce(robust) 


Generalized structural equation model Number of obs = 3,677 
Response: docvis 
Family: Poisson 
Link: Log 
Log pseudolikelihood = -10564.921 
( 1) [docvis]L = 1 


Robust 
Coefficient std. err. Zz P>lz| [95% conf. interval] 
docvis 
private . 1835492 .035085 5.23 0.000 . 1147838 . 2523146 
medicaid .085144 .0479742 1.77 0.076 - .0088837 . 1791716 
age . 322203 .061233 5.26 0.000 . 2021886 -4422174 
age2 -.0021039 . 0004067 -5.17 0.000 -.002901 -.0013068 
educyr .0296122 .0045635 6.49 0.000 .0206679 .0385565 
actlim .1647751 .0362693 4.54 0.000 .0936886 . 2358615 
totchr .311443 .0123445 25.23 0.000 . 2872483 . 3356377 
L 1 (constrained) 
_cons -11.78405 2.29393 -5.14 0.000 -16.28007 -7 . 288027 
var (L) .6272045 .0225134 . 5845954 .6729193 


The variance is clearly different from zero. 


This example is unusual because the conditional mean is unchanged by 
adding a mixture. We have 
E(y;|xi, Li) = exp(x; 6 + Li) = exp(L;) x exp(x/G), so 
E(y;|xi) = a x exp(x; 6B) = exp(Ina + x; 6), where a = E(L;) simply 
changes the intercept of the model. This special result holds for a conditional 
mean that is exponential or linear. 


So, provided models are correctly specified, we can compare the slope 
coefficients across models. We do so for Poisson, Poisson-normal, and negative 
binomial models. 


* GSEM: Compute Poisson, Poisson-normal, and negative binomial 
. qui poisson docvis private $xlist2 


estimates store pois_def 
. qui poisson docvis private $xlist2, vce(robust) 
estimates store pois_rob 
. qui gsem (docvis <- private $xlist2 L, poisson) 
estimates store normal 
. qui gsem (docvis <- private $xlist2 L, poisson), vce(robust) 
estimates store norm_rob 
. qui nbreg docvis private $xlist2 
estimates store nb2_def 
. qui nbreg docvis private $xlist2, vce(robust) 


estimates store nb2_rob 


We obtain the following estimates, along with default and heteroskedastic— 
robust standard errors. 


. * GSEM: Compare Poisson-normal with Poisson and negative binomial 
. estimates table pois_def pois_rob normal norm_rob nb2_def nb2_rob, 
> eq(1) b(%8.4f) se stfmt(%8.1f) stats(11 N) 


Variable pois_def pois_rob normal norm_rob nb2_def nb2_rob 
#1 

private 0.1422 0.1422 0.1835 0.1835 0.1641 0.1641 
0.0143 0.0364 0.0344 0.0351 0.0332 0.0369 
medicaid 0.0970 0.0970 0.0851 0.0851 0.1003 0.1003 
0.0189 0.0568 0.0469 0.0480 0.0454 0.0567 
age 0.2937 0.2937 0.3222 0.3222 0.2941 0.2941 
0.0260 0.0630 0.0618 0.0612 0.0602 0.0646 
age2 -0.0019 -0.0019 -0.0021 -0.0021 -0.0019 -0.0019 
0.0002 0.0004 0.0004 0.0004 0.0004 0.0004 
educyr 0.0296 0.0296 0.0296 0.0296 0.0287 0.0287 
0.0019 0.0048 0.0044 0.0046 0.0042 0.0049 
actlim 0.1864 0.1864 0.1648 0.1648 0.1895 0.1895 
0.0146 0.0397 0.0362 0.0363 0.0348 0.0394 
totchr 0.2484 0.2484 0.3114 0.3114 0.2776 0.2776 
0.0046 0.0126 0.0122 0.0123 0.0121 0.0132 

L 1.0000 1.0000 

0.0000 0.0000 
_cons -10.1822 -10.1822 -11.7840 -11.7840 -10.2975 -10.2975 
0.9720 2.3692 2.3128 2.2939 2.2474 2.4241 

var (L) 0.6272 0.6272 

0.0221 0.0225 
/lnalpha -0.4453 -0.4453 
0.0307 0.0378 

Statistics 

11 | -15019.6  -15019.6 -10564.9 -10564.9 -10589.3 -10589.3 
N 3677 3677 3677 3677 3677 3677 


Legend: b/se 


The slope coefficients vary by 10% to 30% across the models. The Poisson- 
normal model fits better than the negative binomial with a log likelihood that is 
25.6 higher, and with robust standard errors that are quite close to default 
standard errors because the normally distributed unobserved heterogeneity term 
controls for overdispersion. 


23.6.5 Poisson model with endogeneity 


We consider the structural-model approach of section 20.7.2, a model for 
docvis with endogenous regressor private. There a two-step estimator was 
used. Here we consider a more structural model that introduces an error term u 


that is common to both the Poisson equation of interest and to the first-stage 


equation for the endogenous regressor. 


To do so, we obtain the MLE using the gsem command. For both equations, 


we add the latent variable L. We obtain 


. * GSEM: Endogenous regressor in a Poisson model 
. gsem (docvis <- private $xlist2 L, poisson) 


> (private <- $xlist2 income ssiratio L), nolog vce(robust) 


Generalized structural equation model 


Response: docvis 
Family: Poisson 
Link: Log 
Response: private 
Family: Gaussian 
Link: Identity 


Log pseudolikelihood = -12796.769 


Number of obs = 3,677 


( 1) [docvis]L = 1 
Robust 
Coefficient std. err. Zz P>|z| [95% conf. interval] 
docvis 
private .6331611 . 2194449 2.89 0.004 . 2030571 1.063265 
medicaid . 2675884 . 1005366 2.66 0.008 .0705402 . 4646366 
age . 3687153 . 0660348 5.58 0.000 . 2392895 .4981411 
age2 - .0023977 . 0004366 -5.49 0.000 - .0032533 -.001542 
educyr .0176344 .0073831 2.39 0.017 .0031637 .0321051 
actlim . 1852541 .0385927 4.80 0.000 . 1096138 . 2608945 
totchr . 3038513 .0130789 23.23 0.000 .2782171 . 3294855 
L 1 (constrained) 
_cons -13.71873 2.504824 -5.48 0.000 -18.6281 -8.809367 
private 
medicaid -.3945375 .0173646 -22.72 0.000 -.4285714 - . 3605036 
age -.0829381 .0293402 -2.83 0.005 -.1404439 -.0254323 
age2 .0005247 .0001956 2.68 0.007 .0001412 .0009081 
educyr .0213581 .0020524 10.41 0.000 .0173355 .0253806 
actlim - .0297923 .0176555 -1.69 0.092 -.0643966 .0048119 
totchr .0185451 .0057356 3.23 0.001 .0073036 .0297865 
income .0026301 .0004877 5.39 0.000 .0016742 .0035859 
ssiratio -.0724733 .0211669 -3.42 0.001 -.1139597 - .030987 
L -.1355801 .0575232 -2.36 0.018 - . 2483236 -.0228367 
_cons 3.528705 1.094523 3.22 0.001 1.38348 5.673931 
var (L) .6679536 .047179 .5815997 . 7671291 
var (e.private) . 1850348 .0110503 . 164596 .2080115 


The structural equation coefficient estimates are generally within 20% of the 
two-step estimates given in section 20.7.2. 


The latent variable is statistically significant at 5%. In this example, the 
Stata command normalizes the coefficient of the latent variable L to equal one 
in the linear model. 


23.6.6 Prediction and MEs 


For GSEMs that include latent variables, predictions and MEs vary with the 
method used to control for the latent variables. The methods used are similar to 
those for nonlinear mixed models detailed in section 23.4.3. 


For predict, there are four options mu conditional (ebmeans), the default; 
mu conditional (ebmodes); mu conditional (fixedonly); and mu marginal, 
which numerically integrates out the latent variables. 


We apply several of these prediction options to the current example. 


. * GSEM: Predict y controlling for latent variable L 
. predict mu_ebmeans, mu conditional (ebmeans) 
(using 7 quadrature points) 


. predict mu_fixed, mu conditional (fixedonly) 

note: The prediction for “docvis” has an observed endogenous regressor. 

note: Observed endogenous regressors will be treated as fixed at their current 
values. 


. predict mu_marg, mu marginal 

note: The prediction for “docvis” has an observed endogenous regressor. 

note: Observed endogenous regressors will be treated as fixed at their current 
values. 


The notes in the output indicate that the predictions are based on the observed 
values of the endogenous regressor private, which is fine because the 
predictions use parameter estimates that control for this endogeneity. 


The predictions and their correlations are as follows. 


* GSEM: Compare predictions of y 
. summarize docvis mu_ebmeans mu_fixed mu_marg 


Variable Obs Mean Std. dev. Min Max 
docvis 3,677 6.822682 7 .394937 (0) 144 
mu_ebmeans 3,677 6.469881 6.545503 . 7709767 138.1847 
mu_fixed 3,677 5.361157 3.357089 1.165272 32.09691 
mu_marg 3,677 7.486914 4.688211 1.627315 44.82368 


. correlate docvis mu_ebmeans mu_fixed mu_marg 


(obs=3,677) 
docvis mu_ebm“s mu_fixed mu_marg 
docvis 1.0000 
mu_ebmeans 0.9942 1.0000 
mu_fixed 0.3567 0.4036 1.0000 
mu_marg 0.3567 0.4036 1.0000 1.0000 


The conditional (ebmeans) option estimates a mean for each grouping of 
observations, but here there is no grouping, so it is close to including a separate 
fixed effect for each individual. Thus, the prediction mu_ebmeans essentially 
equals the sample value of docvis. The fixedonly prediction sets Z, — 0 for 
each observation, so the fitted value for y2 is simply exp(G1y2 + X4 Bo): 
Because the conditional mean here is of exponential form, the marginal 
prediction is a simple multiple of this; see section 23.6.4 for an explanation. So 
mu fixed and mu marg are perfectly correlated. 


The margins postestimation command is not applicable to the current 
example, because the command works only for statistics whose computation 
does not involve latent variables. Instead, MEs can be obtained using the 
manual method of section 13.7.10. First, predict using the mu marginal option. 
Second, compute a numerical derivative by perturbing the regressor by a small 
amount, repredicting, and comparing average predictions. Third, embed this in 
a bootstrap loop to get a standard error for the AME. 


23.6.7 GSEM for FMM with multiple outcomes 


The fmm prefix, presented in section 23.2, estimates single-equation FMMS. 
These models can also be fit using the gsem command. For example, exactly 
the same estimates are obtained by the commands fmm 2: poisson y x1 x2 and 
gsem (y <- x1 x2), poisson lclass(C 2). 


The gsem command can be extended to allow estimation of an FMM with 
more than one outcome variable. Examples include pairs of binary, 
multinomial, count, or continuous outcomes. The simultaneous estimation of 
multiple FMMs provides combined output that facilitates comparison across 
responses. Additionally, it is easy to impose and test parametric restrictions 
across equations. These advantages are appealing especially when the multiple 
outcomes are related or similar in some way. 


We use data on the number of annual emergency room visits (emr) and 
hospitalizations (hosp) from the Medicare population. The regressors in the 
model are privins (private insurance), medicaid (public insurance), numchron 
(number of chronic conditions), age (age/10), and male. The outcome variables 
are positively correlated. Because these outcomes are relatively rare even in the 
elderly population, the outcomes have a high proportion of zero events. 


. * GSEM: Read in, and describe data for bivariate dependent variables example 
. qui use mus223gsemfmmexample, clear 


. global xlist privins medicaid numchron age male 


. Summarize emr hosp $xlist 


Variable Obs Mean Std. dev. Min Max 
emr 4,355 .2597015 .6986564 (0) 12 

hosp 4,355 . 289093 . 7342015 (0) 8 
privins 4,355 . 7752009 .4174979 (0) 1 
medicaid 4,355 .0911596 . 287869 (0) 1 
numchron 4,355 1.534099 1.346467 (0) 8 
age 4,355 7.405052 .6345418 6.6 10.9 

male 4,355 . 4043628 . 4908247 (0) 1 


From output not listed, the correlation of emr and nosp is 0.477. 


We fit a bivariate two-component FMM negative binomial model for the two 
count outcomes. 


. * GSEM: Fit two-component FMM bivariate negative binomial 
. qui gsem (emr hosp <- $xlist), nbreg lclass(C 2) 
> startvalues(randomid, draws(5) seed(15)) vce(robust) 


. estat lcprob 


Latent class marginal probabilities Number of obs = 4,355 


Delta-method 


Margin std. err. [95% conf. interval] 
c 
1 . 7469986 .026645 .6913295 . 7955961 
2 . 2530014 .026645 . 2044039 . 3086705 


For brevity, the estimation results from the gsem command are suppressed. The 
complete output yields in order a lengthy log file of over 100 iterations; the log 
likelihood; parameter estimates for the latent class probabilities; parameter 
estimates for the first latent class (in order for emr and then hosp); and 
parameter estimates for the second latent class results. The latent class 
probabilities for emr and hosp are 0.75 and 0.25. 


We then obtain the marginal means in the two classes. 


. * GSEM: Latent class means for two-component FMM bivariate negative binomial 
. estat lcmean 


Latent class marginal means Number of obs = 4,355 


Delta-method 


Margin std. err. Zz P>Izl [95% conf. interval] 

1 
0550896 .0117289 4.70 0.000 .0321014 .0780778 
.0741005 .0128806 5.75 0.000 0488551 . 099346 

2 
.8513358 .0637465 13.36 0.000 . 7263949 . 9762767 
9161608 . 0683449 13.40 0.000 . 7822072 1.050114 


The second latent class (with probability 0.25) has mean usage rates of 0.851 
and 0.916 for the two outcomes, which are more than 10 times the rates of 
0.055 and 0.074 for the first latent class. 


As an example of an ME, we compute the AME of the number of chronic 
conditions. 


. * GSEM: MEs of numchron 
. margins, dydx(numchron) 


Average marginal effects Number of obs = 4,355 
Model VCE: Robust 


dy/dx wrt: numchron 


1._predict: Predicted mean (Number of annual emergency room visits), using class 
probabiliti, predict(mu outcome(emr) ) 

2._predict: Predicted mean (Number of hospital admissions), using class 
probabilities, predict(mu outcome(hosp) ) 


Delta-method 
dy/dx std. err. z P>lzl [95% conf. interval] 
numchron 
_predict 
1 .0760105 . 008604 8.83 0.000 059147 092874 
2 .0953072 0087897 10.84 0.000 .0780797 . 1125347 


Both outcomes increase considerably with an additional chronic condition, and 
the effect is highly statistically significant. Emergency room visits increase by 
0.076 compared with a sample average of 0.260, and hospital admissions 
increase by 0.095 compared with a sample average of 0.289. 


23.7 ERM commands for endogeneity and selection 


The ERM commands are designed to deal with linear and some nonlinear 
regression models that include one or more endogenous regressors and, 
moreover, are subject to potential selectivity bias because of sample selection. 


The commands provide ML estimates under the assumption of joint 
normally distributed errors, and with few exceptions, consistent parameter 
estimation will require completely correct model specification. The ERM 
commands use numerical integration methods intmethod (mvaghermite), the 
default, or intmethod(ghermite) ; see section 23.4.1. 


23.7.1 Overview of ERM commands 


Table 23.3 provides a summary of the ERM commands for cross-sectional data 
and the complications that they deal with. The table also includes the 
corresponding xt commands, which provide extensions to panel data and 
clustered data by introducing a random effect to allow for intracluster 
correlation. 


Table 23.3. Summary of ERM commands and command options 


Stata ERM command Outcome model and type of outcome 


eregress, xteregress linear for continuous outcome 

eprobit, xteprobit probit for binary outcome 

eoprobit, xteoprobit ordered probit for ordered discrete outcome 
eintreg, xteintreg interval regression for interval data 


Stata ERM command option Complication considered 


endogenous () endogenous continuous regressors 
select () sample selection 

tobitselect() tobit sample selection 

extreat () exogenous discrete treatment 
entreat () endogenous discrete treatment 


The main features of the ERM commands are as follows: 


1. An ERM command can be used when the regression of interest is 
embedded in a triangular or recursive model and has right-hand-side 
endogenous variables for which a reduced-form-type model is specified. 
Thus, no structural feedback from the endogenous regressors to the left- 
hand-side variable of interest is specified. 

2. The full model is parametric, the joint distribution of the errors in the 
model is multivariate (bivariate) normal, and estimation is by ML methods 
that use quadrature integration methods. 

3. An ERM command is flexible because it allows one to simultaneously 
address multiple complications, such as both endogeneity and selection. 

4. When the model is linear but not recursive, it must be triangularized 
before one applies an ERM command. This happens when the specified 
model involves simultaneous dependence. The remedy is to eliminate 
some endogenous variables by substituting reduced forms. Triangulation 
is not feasible for a nonlinear model, such as one involving, say, the probit 
equation, because in this case there is no closed-form reduced-form model 
to substitute. For details, see [ERM] eregress. 


The flexibility of the ERM commands derives from the user’s ability to 
specify and fit a wider range of models by adding or modifying subcommands. 
For example, the command 


eregress yl x, endogenous(y2 = z1 x) select(s1 = z2 x, probit) 


fits a linear regression model with endogenous regressor y2 and an endogenous 
binary selection variable s1, where the instruments z1 for y2 and z2 for s1 are 
excluded from the outcome equation for y1. 


Simpler applications of the ERM commands are equivalent to other model- 
specific commands. Thus, elsewhere in the book, ERM commands have been 
used selectively as an ML-based alternative to several other commands, 
including ivprobit, ivtobit, and ivregress, liml. 


Table 23.4 presents some commands presented in preceding chapters and 
the corresponding ERM command. The prefix e tells the user that the command 
is an ERM command. The syntax has the equation specification followed by the 
subcommand that specifies the reduced form with the instrumental variables z 
and x. For command syntax, see the relevant help files. 


Table 23.4. Selected ERM commands for handling endogeneity and 
selection 


Stata model command Stata ERM command equivalent 


Linear regression with endogenous regressor 


ivregress liml y1 x (y2 = z) eregress y1 x, endogenous(y2 = z x) 
Binary probit with endogenous regressor 
ivprobit y1 x (y2 = z) eprobit y1 x, endogenous(y2 = z x) 
Tobit model with endogenous regressors 
ivtobit y1 x, (y2 = z), eintreg yixll yixul x, 
11 (0) ul(#) endogenous (y2 = z x) 
Linear regression with sample selection 
heckman y x, select(s1 = x Z) eregress y x, select(sl = x zZ) 


Probit model with sample selection 

heckprobit y x, select(sl = x z) eprobit y x, select(s1 = x z) 

Ordered probit with sample selection 

heckoprobit y x, eoprobit y x, select(s1 = x z) 
select(s1 = x z) 


Postestimation commands for ERM commands include margins. If the 
model is fit with the option vce (robust), then the vce (unconditional) option 
of margins gives unconditional standard errors that additionally allow for 
variation in the regressors; see section 13.7.9. 


A detailed application of eregress, entreat () is given in section 25.3.3, 
where it is used to compute treatment effects for a continuous outcome when 
there are four ordered levels of treatment and these treatment levels are 
endogenous. 


23.7.2 The eprobit command 


As an example of the complications handled by ERM commands, we consider 
the simplest forms of the eprobit command and, where appropriate, compare 
the command with related Stata commands. 


The eprobit command without an option yields the same results as the 
probit command, so eprobit y x 1s equivalent to probit y x. 


The eprobit, endogenous () command estimates by ML a probit model 
with regressors that include one or more continuous endogenous regressors. 
The model is that given in section 17.9.2. The ivprobit command computes 
the same ML estimator and additionally provides the option of a two-step 
estimator but is restricted to only one endogenous regressor. 


The eprobit, select () command estimates by ML a probit outcome model 
with the complication of binary selection. The probit outcome is observed only 
for a subset of the population determined by a second probit model with latent 
normal errors correlated across the two equations. The heckprobit command 
computes the same ML estimator. 


The eprobit, tobitselect () command is similar to the eprobit, 
select () command, except that selection is determined by a continuous 
variable restricted to the range (l;, u;), rather than a binary indicator variable. 


The eprobit, extreat () command applies the treatment-effects methods 
of chapter 24 to a binary outcome. In this framework, there are one or more 
treatment variables, and we fit a separate probit model for each distinct value 
of the treatment variables. This is essentially the regression-adjustment model 


of section 24.6.1 adapted to probit regression. The estat teffects 
postestimation command can be used to calculate the average potential- 
outcome probabilities at each treatment level and to compute average treatment 
effects. 


The eprobit, entreat () command extends the eprobit, extreat () 
command by allowing binary treatment and ordered discrete treatments to be 
endogenous, in which case, respectively, binary probit or ordered probit models 
are used with multivariate normal errors correlated across the various 
associated latent variable models. 


The eprobit command options can be combined. For example, the 
command eprobit y xl x2, entreat(d = x3 x2) endogenous (y2 = x2 x4) 
fits a treatment-effects probit model for the binary outcome y that is explained 
by the endogenous discrete treatment a, the endogenous continuous regressor 
y2, and the exogenous regressors x1 and x2. Instruments are x3 for a and x4 for 


y2. 
23.7.3 The eprobit command application 


We consider a probit equation with a single endogenous continuous regressor. 
We model whether someone has supplementary health insurance (ins) using 
the dataset from chapter 17.9. The natural logarithm of income (1inc) is treated 
as being endogenous, with instruments the subject’s retirement status (retire) 
and the retirement status of the spouse (sretire). 


We first read in the data and define global macros for regressors in the 
probit equation and additional regressors used in the reduced-form equation for 


linc. 


. * Read in data, define globals, and summarize key variables 
. qui use mus2i7hrs, clear 


. generate linc = log(hhincome) 
(9 missing values generated) 


. global xlist female age age2 educyear married hisp white chronic adl hstatusg 


. global ivlist retire sretire 


The MLE assuming joint normal errors is obtained using the eprobit, 
endogenous () command. We have 


. * Endogenous probit using eprobit ML estimator (an ERM command) 
. eprobit ins $xlist, endogenous(linc = $xlist $ivlist) vce(robust) nolog 


Extended probit regression Number of obs = 3,197 
Wald chi2(11) = 382.35 
Log pseudolikelihood = -5407.7151 Prob > chi2 = 0.0000 
Robust 

Coefficient std. err. z P>lz| [95% conf. interval] 

ins 
female -. 1394073 .049447 -2.82 0.005 -.2363216 -.0424929 
age . 2862296 . 1280815 2.23 0.025 .0351945 . 5372647 
age2 -.0021472 .0009318 -2.30 0.021 -.0039735 -.0003209 
educyear . 1136882 .0237908 4.78 0.000 .067059 . 1603174 
married . 7058322 .2377541 2.97 0.003 . 2398428 1.171822 
hisp -.5094516 . 1049486 -4.85 0.000 -.7151471 -.3037561 
white . 1563459 . 1035658 1.51 0.131 - .0466394 . 3593312 
chronic 0061938 0275247 0.23 0.822 - .0477536 0601412 
adl -. 1347665 . 0349799 -3.85 0.000 -.2033258  -.0662072 
hstatusg . 2341791 .070975 3.30 0.001 . 0950707 . 3732875 
linc -. 5338273 . 3852044 -1.39 0.166 -1.288814 .2211594 
_cons -10.00788 4.065762 -2.46 0.014 -17.97662 -2.039132 

linc 
female -.0976056 .0305475 -3.20 0.001 -.1574776 -.0377336 
age . 2664422 .073175 3.64 0.000 . 1230217 . 4098626 
age2 -.0019031 . 0005309 -3.58 0.000 -.0029437 -.0008625 
educyear 0947747 . 0044497 21.30 0.000 . 0860534 . 103496 
married . 7847725 .040539 19.36 0.000 . 7053176 . 8642274 
hisp -. 2364977 . 0506833 -4.67 0.000 -.3358351  -.1371603 
white . 232593 .0360103 6.46 0.000 . 1620141 . 303172 
chronic -.0387668 .0103384 -3.75 0.000 -.0590297  -.0185039 
adl -.0741455 .0180171 -4.12 0.000 -.1094584 -.0388326 
hstatusg . 174596 . 0339594 5.14 0.000 . 1080368 .2411552 
retire -.0953103 . 0283004 -3.37 0.001 -.150778 -.0398426 
sretire -.0326619 . 0308983 -1.06 0.290 -.0932216 .0278977 
_cons -7.679616 2.517254 -3.05 0.002 -12.61334 -2.745889 
var(e.linc) .5152062 .0240909 .4700879 . 5646548 


corr(e.linc, 
e.ins) 


. 5879572 


. 2355274 


.50 


.013 


- .0309703 


. 880964 


The last line shows positive and statistically significant (at 5%) correlation 
between the structural and reduced-form residuals, so there is an endogeneity 
problem in this specification. 


The estimates show a perhaps surprising negative effect of Linc on 
insurance choice after controlling for other variables and the endogeneity of 
linc, though the coefficient is statistically insignificant at level 0.05. From 
results not listed here, regular probit estimation of the same model leads to 
linc having a positive coefficient of 0.347 and being much more precisely 
estimated with a standard error of 0.040. 


Additional examples of the ERM commands appear in section 25.3 on 
treatment effects, where several flexible versions of the command are used to 
simultaneously handle endogenous regressors and endogenous treatment 
choice. 


23.8 Additional resources 


The various commands presented in this chapter have potentially many 
applications. For details and many examples, see the Stata reference 
manuals [FMM] Finite Mixture Models, [ME] Mixed Models, [SEM] Structural 
Equation Models, and [ERM] Extended Regression Models. For detailed 
coverage of QR, see chapter 15. 


For comprehensive coverage of mixed models, see Rabe-Hesketh and 
Skrondal (2022). For SEMs, a standard reference is Bollen (1989). For 
GSEMs, see Rabe-Hesketh, Skrondal, and Pickles (2004). The methodology 
of ERMS is presented in Roodman (2011). The reader is referred to the 
earlier chapters dealing with ivregress and ivtobit models for empirical 
illustrations and encouraged to replicate those results using the ERM 
commands. 


23.9 Exercises 


1. Consider the gamma regression data used in example 1 of section 23.3. 
Estimate the gamma regression given the one-component conditional 
mean specification used in that example. Generate predicted values of 
expenditure, and using a two-way scatter diagram, compare them with 
observed values as a rough goodness-of-fit test. Are there any 
indications of lack of fit? Next, graph the kernel density of the 
residuals from this regression. Check for visual evidence of 
multimodality of the graph. 

2. Continuing with the expenditure data of the preceding exercise, fit a 
varying probability FMM2 model in which the latent class probability 
depends upon totchr (rather than female and age as in the example). 
Compare the fit of this model with that of the constant class 
probability FMM2 specification. 

3. Example 6 of section 23.3 concerns a point-mass (zero-inflated 
Poisson) model for count data. This model has two parts, the second 
one being the truncated Poisson model. Stata’s fmm prefix supports 
estimation of mixtures of truncated Poisson regression. Using the 
truncated-at-zero version of the same dataset and the same model 
specification, estimate the truncated Poisson regression, the FMM2, and 
FMM3 versions of the same. Select the best-fitting model according to 
AIC and BIC criteria, and compare the estimates of the AME of totchr on 
er visits in the three models. 

4. Consider the just-identified Iv application of section 7.4.4 that uses 
data from mus207mepspresdrugs.dta for older people in 
U.S. Medicare. We wish to regress 1drugexp on an indicator variable 
for access to additional employer or union-sponsored health insurance 
(hi_empunion), number of total chronic conditions (totchr), and 
socioeconomic variables age, female, blhisp, and linc. Variable 
hi_empunion is endogenous, and we have single instrument ssiratio, 
which is the ratio of an individual’s social security income to the 
individual’s income from all sources. Obtain Iv estimates using the 
ivregress command. Obtain the equivalent estimates using the gsem 
command. In both cases, obtain heteroskedastic—robust standard 


errors. Do the two commands give identical coefficient estimates? Do 
they give identical standard errors? 

. Reconsider the GSEM empirical application of section 23.6.7. Carry out 
estimation and postestimation operations after changing the nbreg 
option in the gsem command to poisson and applying the parametric 
constraint to the numchron variable. Also, reestimate the specification 
without constraints, and then apply the LR test of the constraints. 


Chapter 24 
Randomized control trials and exogenous 
treatment effects 


24.1 Introduction 


This chapter and the next are concerned with identification and consistent 
estimation of treatment-effect (TE) parameters that measure the causal 
impact of a change in some controllable economic variable, henceforth 
referred to as the treatment variable, on some policy or program target 
variable, henceforth referred to as the outcome. We concentrate on 
empirical aspects of treatment evaluation, along with explanation of these 
methods. For more details on treatment evaluation, see Angrist and 


Cameron and Trivedi (2005, chap. 25). 


The topic of TEs is not new in this book. As a parameter that measures 
the impact of some manipulable variable, a TE is closely related to the 
marginal effect of a change in a variable. Marginal effects have been 
discussed at many places in the book and especially in section 13.7. 
However, much of that discussion was in the context of observational data 
and estimation based on either a fully parametric or a semiparametric 
regression model. 


The treatment evaluation literature focuses on the case where the 
treatment level is discrete valued and, furthermore, is most often binary, 
leading to comparison of treatment with control. An experimental 
viewpoint is taken as the starting point, rather than the usual econometrics 
starting point of regression. The potential-outcomes framework, under 
which an individual has several potential outcomes that vary with the 
treatment level, is used. The complication is that only one of these potential 
outcomes, corresponding to the treatment received, can be observed; the 
remaining potential outcomes are unobserved counterfactuals. 


A treatment is usually an intervention. Such an intervention may arise 
from entirely exogenous and unforeseen changes in variables, often referred 
to as “natural experiments”, that may be interpreted as “treatments”. 
Alternatively, an intervention may be designed and implemented in a 
specific setting with the objective of estimating a “pure” or uncontaminated 
TE—a case referred to as a social experiment or a randomized control trial 


(RCT). Finally, an intervention may be initiated by an agent whose response 
we wish to study, as is usually the case for observational data. 


RCTs follow the format of drug trials in which inference about the effect 
of an intervention is estimated by comparing the response of randomly 
selected and treated subjects with that of the randomly selected control 
individuals who are drawn from the same population as the treated subjects 
but who were not assigned to the treated group. In a well-designed RCT, the 
average difference in the outcome of the treated and control groups is 
attributed to the causal impact of the intervention. 


Randomization is a well-established and widely applied methodology in 
experimental settings that arise naturally in agriculture, biomedical 
sciences, psychology, and so forth. When suitable observational data are not 
available, an RCT is motivated partly to generate data and partly to overcome 
potential problems of observational data. More recently, such methods have 
been applied also in economics, especially in development economics; see 
Glennerster and Takavarasha (2013). 


Experimental data can be used to test for the impact of treatment using 
relatively simple statistical methods such as difference-in-means tests. This 
chapter first describes experimental design and assumptions and issues that 
affect the power of tests applied to experimental data; for the latter, see also 
section 11.7. Well-designed experiments attain high power, so we consider 
features of experiments that affect power of tests based on data from RCTs. 


Even when experimental data are available, regression-based methods 
can have a useful role in estimating TEs. First, the inclusion of regressors 
may lead to more precise estimation of the TE. Second, actual 
implementation of an RCT can be complex and result in systematic 
differences in the individuals in treatment and control samples, in which 
case regression-adjustment (RA) methods such as inverse-probability 
weighting may be appropriate. 


We initially focus on an RCT in this chapter before studying TEs using 
observational data. With observational data, it is necessary to introduce 
control variables in addition to the treatment variable. Furthermore, to give 
a causal interpretation to estimates requires the very strong assumption that 


sufficient controls are included in models so that conditional on these 
controls, any selection into treatment is uncorrelated with the potential 
outcomes. This nontestable assumption of exogenous treatment goes by 
several names, including unconfoundedness, ignorability, and selection on 
observables. This chapter covers methods for both RcTs and for 
observational data in the special case that selection into treatment is 
assumed to be exogenous. In chapter 25, we consider additional methods of 
treatment evaluation, including methods for when it is unreasonable to 
assume that selection is only on observables, so that the treatment is 
endogenous. 


The methods presented in this chapter are illustrated using a sample of 
data from a recent social experiment—the Oregon Health Insurance 
Experiment (OHIE). This RCT was a lottery whose winners were granted the 
option of applying for enrollment in Medicaid, the state health insurance 
program for low-income individuals. Interest lies in the impact on various 
health outcomes. 


24.2 Potential outcomes 


Treated and untreated outcomes cannot be simultaneously observed for a 
given individual; this fact is a major obstacle for identification and 
estimation of a causal parameter. It has been labeled the fundamental 
problem of causal inference; see Holland (1986). The concept of potential 
outcomes plays a key role in resolving the difficulty. 


Let D be the hypothesized cause and y the outcome. Let D change to 
Dı from Do and y change to yı from Yo. Then given observation of y1, no 
conclusion can be drawn about the causal impact without a hypothesis 
about what value y would have assumed in the absence of the change in D. 
This is referred to as the counterfactual, that is, the hypothetical unobserved 
value that forms the basis of comparison. This can also be viewed as a case 
of missing data. To resolve the difficulty, the investigator essentially 
generates the missing observations using a model. For the treated group, the 
missing data are realizations that would have been observed in the absence 
of treatment. For the untreated group, they are the realizations that would 
have been observed had a treatment been assigned. Given a dataset thus 
expanded, a comparison of such generated observations with observed data 
then forms the basis of counterfactual causality inference. 


Causal inference involves comparison of a factual with a counterfactual 
outcome. The so-called Rubin causal model —also called the potential- 
outcome model—deals with causal parameters based on counterfactuals. In 
that framework, the term “treatment” is used interchangeably with cause, 
and it is assumed that all members of the target population are potentially 
exposed to the treatment. 


The triplet (y1;, Yoi, Di), i = 1,...,.N, forms the basis of treatment 
evaluation. Initially, we consider the treatment D to be homogeneous and 
binary valued, so D takes the values 1 and 0, respectively, when treatment 
is or is not received; Yı: measures the response for individual į when 
receiving treatment; and Yoi, when not receiving treatment. That is, 


J ye if Di =1 


Receiving and not receiving treatment are mutually exclusive states, so 
(Yii — Yoi) is not observable for any given individual į because only one of 
the two measures is available, the unavailable hypothetical measure being 
the counterfactual. 


Hence, inference is instead made about the average effect of the 
treatment and not the effect on any single individual. The average causal 
effect of D; = 1, relative to D; = 0, is measured by the average treatment 
effect (ATE), defined as 


ATE = E(y|D = 1) — E(y|D = 0) 


where expectations are with respect to the probability distribution of the 
random component over the target population. The ATE can be consistently 
estimated by the sample counterpart (Y4 — Yo), under the assumption that 
each observed outcome includes an additive zero-mean random component 
that is uncorrelated with the treatment assignment. 


The ATE parameter can be estimated from models based on experimental 
data or observational data. The experimental approach involves a random 
assignment of treatment followed by a comparison of mean outcomes of the 
treated and control cases, using (Y4 — Yo). With observational data, as 
detailed throughout this chapter, additional assumptions are needed, and 
other individual-specific variables are added to serve as controls. In the 
potential-outcomes framework, only some variables are considered causal, 
while most are regarded as controls for extraneous variation in the outcome. 
An advantage of this framework is that the counterfactual can clearly 
indicate what should be compared. 


The ATE need not be the only parameter of interest. The ATE on the 
treated (ATET) is 


ATET = E(y1|D = 1) — E(yo|D = 1) 


Other quantities of interest include quantile TEs, introduced in section 15.5 
and presented in section 24.11. 


24.3 Randomized control trials 


RCTs are often considered the gold standard against which one compares 
other econometric/statistical approaches to causal inference. The core appeal 
of the RCT methodology is based on two considerations—identifiability of 
the TE and computational simplicity. 


The strength of RCT as a tool for causal inference is based on three strong 
assumptions about treatment assignment; see Imbens and Rubin (2015, 
chap. 3). First, treatment assignment is individualistic, meaning a given 
subject’s probability of receiving treatment does not depend upon the 
characteristics, outcomes, or treatment assignments of the other subjects. By 
assumption, there are no spillover effects that could potentially contaminate 
the comparison of the treated and control groups. Second, all members of the 
target population can potentially receive the treatment. Hence, there is 
positive probability that a randomly selected member of the population can 
receive treatment. Third, the treatment assignment is independent of the 
potential outcome. This rules out self-selection under which individuals who 
stand to gain more by receiving treatment are more likely to self-select into 
it. 


It follows that if the RCT is properly planned and correctly implemented, 
the problem of selection bias is eliminated. In contrast, the problem of self- 
selection into treatment plagues analyses based on observational data. Given 
self-selection bias, the estimate of the TE is confounded with (that is, 
contaminated by) the self-selection effect, and the TE is then not identified. 
Under an RCT, the treatment is exogenously and randomly applied to a subset 
of participants whose responses can be compared with those of other 
individuals who could potentially have received the treatment but did not. 
Random assignment implies that treatment assignment does not depend upon 
the observable characteristics of an individual, and hence ignoring them will 
not generate biases. (If a blocked or stratified experimental design is used, 
then assignment within each block is random.) An estimate of the TE is based 
on such a comparison; the details follow in a later section. The RCT 
methodology ensures that the TE is not confounded with changes that 
emanate from sources unrelated to the treatment. 


The second appealing feature of an RCT is that the estimation of the TE is 
computationally simpler than is typically the case with observational data. In 
the simplest of cases, estimation of the ATE parameter reduces to a 
comparison of the mean response of the treated and untreated groups. In 
contrast, the corresponding estimates based on observational data are usually 
based on (often nonlinear) regression that may invoke several auxiliary 
assumptions. 


This apparent simplicity of estimation in an RCT setting, however, 
becomes more complicated if the underlying assumptions and requirements 
of an ideal RCT are not achieved. Then, even in an RCT, alternative estimators 
may need to be used. 


24.3.1 Simple RCT setting of difference in means 


Consider a simplified setting for evaluating a treatment applied to a group of 
N subjects who voluntarily consent to participate in the trial. Of these, a 
randomly selected subset of size N; is assigned to the treatment, while No 
subjects are assigned to the control group that receives no treatment. At the 
end of the trial, all N; + No are tested by comparing the average score of the 
treated group, denoted Y,, with the average of the untreated control group, 
denoted Yo, to see whether the TE is statistically significant. The comparison 
is implemented using a two-sample + test. 


The test of the null hypothesis of zero effect is a test of differences in 
means. Given independent and identically distributed (1.i.d.) data in the 
treatment group and 1.1.d. data in the control group, the test is based on the 
statistic 


= Yo) (m Lo) (24.1) 


where sp, the standard error of (Y4 — Yq), is given by 


(24.2) 


sp = y (si/M1 + 89/No) 


and sq = X (yu — Y1)?/(M1 — 1) and 59 = X (yoi — Yo)?/(No — 1) are, 
respectively, the usual estimated standard deviations of y1 in the treated 
group and Yo in the control group. The null hypothesis sets (441 — uo) = 0. 
Large values of ¢ lead to rejection of the null hypothesis. 


The test statistic has an approximate Student’s ¢ distribution with v 
degrees of freedom. The difference-in-means test is detailed in 
section 3.5.12. The test can be implemented using the command ttest y, 
by (D) unequal, which uses Satterthwaite’s approximation, which sets 
v = {(82/N1) + (58/No)?}/{(s2/m1)/(Ni — 1) + (88/No)2/(No — @1)}. 
Alternatively, the test can be implemented as a ¢ test of the coefficient of D 
following the command regress y D, vce (robust), with v = Ni + No — 2. 
In addition to this different value of v, the regress command leads to a 
slightly different value of ¢ because it uses a slightly different degrees-of- 
freedom correction in obtaining sp, the standard error of (Y4 — Yo); see 
section 3.5.12. 


Use of the t(v) distribution for inference is only an approximation. An 
exact test of Ho : yı — uo based on the ¢ statistic can be performed using a 
permutation test; see section 11.10. To date, this has been rarely done in 
econometrics studies. A notable exception is Young (2019), who applies 
randomization inference to many published RcTs and, where relevant, 
corrects for clustering. 


24.3.2 Optimal sample size and power analysis 


Running an RCT can be expensive. The goal of experimental design is to 
choose treated and control groups large enough to detect whether treatment 
has an effect, allowing for inherent randomness in any experiment, but not 
unnecessarily large. The design depends in part on the variance of the 
outcome variable, a parameter about which accurate information may not be 
available prior to running the trial. Often, a pilot experiment is run to obtain 
a preliminary estimate of variance. 


An RCT tests the null hypothesis of zero impact of intervention against 
the alternative of a nonzero impact for a two-tailed test or a positive (or 
negative) impact if a one-tailed test is used. Such tests of hypotheses in the 
Neyman-—Pearson framework are subject to both type I error of rejecting Ho 
when it is true and a type II error of failing to reject Ho when it is false. 


Table 24.1 summarizes the notation used for power analysis. Standard 
practice is to fix type I error by using a conventional significance level a, 
often 0.05. The probability of a type II error, denoted 8, depends on the 
mean TE size, 


ô = Hı — Ho (24.3) 


where ju; = E (y1) is the mean in the treatment group and uo = E (yo) is the 
mean in the control group. To reach useful conclusions about interventions, 
one designs the RCTs to achieve high power against a meaningful alternative, 
where power m = 1 — 6 equals one minus the probability of a type II error. 
A conventionally preferred level of power is 80%, so 8 = 0.20 and 7 = 0.80 


Table 24.1. Size and power analysis terminology and notation 


Description Symbol 
Total sample size N 
Treated sample size Nı 
Control sample size No 


Treatment group (mean, variance) (1, 77) 


Control group (mean, variance) (uo, 02) 
Treatment-effect size Ô = ui — Ho 
Significance level aq 

Type 2 error probability B 

Power m=1-86 


The RCT implementation requires a decision on the minimum detectable 
mean difference between the treatment and the control group (List, Sadoff, 
and Wagner 2011). This is essentially the minimum value of § that the RCT 
will be able to detect, given the desired significance level a and power 


(1 — £). 


Just like the underlying ¢ test, the power function varies according to 
whether the variances g? and o? are known. The alternative hypothesis leads 
to a translation of the mean. In the case of known variances, inference is 
based on the normal distribution because a translated mean of the normal 
leads again to a normal distribution and the power function involves the 
normal distribution. In the case of unknown variances, inference is based on 
the ¢ distribution, and a translated mean of the ¢ leads to a noncentral t 
distribution. For economics applications, the variances are unknown and 
need to be estimated, so we use the ¢ statistic given in (24.1)-(24.2). 


Let T(v, A) denote the cumulative distribution function of a noncentral t 
distribution, with v degrees of freedom and noncentrality parameter A, and 
let tv a denote the area a in the right tail of the usual central t(v) distribution 
[that is, the (1 — a)th quantile]. Then, for given effect size § and standard 
error Sp of the mean difference (Y4 — Yo) defined in (24.2), the power of a 
two-sided test of Hp : 41 = uo Of size a is defined by 


n (ð, SD, a) =1- LOX ie) + 1S tina/2) 


where v is Satterthwaite’s degrees of freedom and 

A = 6/8p = |u — Hol/sp is the noncentrality parameter. For an upper one- 
sided test, the power at size a is 1 — T,,,(—t,,.). And for a lower one-sided 
test, the power at size a is T,,,(—ty,q). 


Power analysis in experimental design sets size a at a particular value. 
There are then three types of power analysis. First, compute the power 7 
given specified § and Ny, No, $1, and So (and hence sp). Second, compute 
the minimum effect size § given specified power 7 and N1, No, $1, and so 
(and hence sp). Third, compute the sample sizes N; and Np given specified 


power 7, effect size 6, and the sample standard deviations sı and So. 
Standard practice sets significance level to 0.05 and desired power to 0.8 or 
0.9. 


24.3.3 Sample size and power calculations in Stata 


In Stata, power analysis for a two-means test is implemented using the power 
twomeans command. The command has the following syntaxes, which are 
similar to those for the power onemean command, presented in 

section 11.8.1. To compute power, use 


power twomeans m1 m2, n(numlist) [ options | 

To compute minimum effect size, use 

power twomeans mı, n(numllist) power (numiist) [ options | 
And to compute sample sizes, use 


power twomeans mı m2 E power (numiist) options | 


where mı and ™z are the means of control and treatment groups 
corresponding to, respectively, 4o and (1 in the notation of this chapter. 


The defaults set a = 0.05 and m = 0.80 and use the (v) distribution. 
The option knownsds instead uses the standard normal distribution. 


The preceding command is for continuous outcomes. For binary 
outcomes, one can instead use the power twoproportions command. Other 
commands for binary treatments are power pairedmeans, power 
pairedproportions, power twovariances, power twocorrelations, and 


power exponential. 
24.3.4 Some examples 
We provide a series of examples that are based on hypothetical scenarios in 


which the treatment and control groups are equal in size, have the same 
estimated variance, but have different means. Throughout, the size is set at 


the default of 0.05. In most examples, we desire power of 0.80; that is, 

type II error = 0.20. This means that we want to reject the null hypothesis 
of 0 TE in 80% of the cases when it is false. Under this specification, the 
larger the difference in the means, the smaller the required sample to achieve 
our power objective; that is, for a given sample size, a false null is more 
easily reyected when the sample means are far apart. 


Example 1 (sample size): We fix the control group mean at 21 but vary 
the treatment mean from 23 through 26; that is, the TE § increases from 2 
(less than 10%) to 5 (slightly less than 25%). 


. * (1) Required sample size when m1=21; m2=23,24,25,26; sd=6 
. power twomeans 21 (23(1)26), sd(6) 

Performing iteration ... 

Estimated sample sizes for a two-sample means test 


t test assuming sdi = sd2 = sd 
HO: m2 = mi versus Ha: m2 != m1 


The larger the TE, other parameters equal, the smaller the required sample 
size to meet the power objective. The estimated sample size drops from 286 
when 6 = 2 and to 48 when § = 5. This implies that when the TE is small, a 
relatively larger sample is required to detect it with precision. 


Example 2 (sample size): This example considers a TE of 3 units but 
increases the required power from 0.80 to 0.90. 


* (2) Required sample size when m1=21; m2=24; sd=6; power=0.9 
. power twomeans 21 24, sd(6) power(.9) 


Performing iteration ... 


Estimated sample sizes for a two-sample means test 
t test assuming sdi = sd2 = sd 
HO: m2 = mi versus Ha: m2 != ml 


Study parameters: 


alpha = 0.0500 
power = 0.9000 
delta = 3.0000 
mi = 21.0000 
m2 = 24.0000 
sd = 6.0000 
Estimated sample sizes: 
N = 172 
N per group = 86 


Relative to the example 1 calculation with power set at 0.80 when § = 3, the 
estimated sample size is 172 instead of 128. Larger samples yield higher 
power, with other parameters unchanged. 


Example 3 (sample size): In this example, we consider the sample size 
implications of a one-sided test. We do the same calculation as in example 2, 
with power 0.90, but using a one-sided alternative hypothesis. We expect 
that a smaller sample will be required to achieve the same power. 


. * (3) Required sample size when m1i=21; m2=24; sd=6; power=0.9; one-sided 
. power twomeans 21 24, sd(6) onesided 


Performing iteration ... 


Estimated sample sizes for a two-sample means test 
t test assuming sdi = sd2 = sd 
HO: m2 = mi versus Ha: m2 > m1 


Study parameters: 


alpha = 0.0500 
power = 0.8000 
delta = 3.0000 
mi = 21.0000 
m2 = 24.0000 
sd = 6.0000 


Estimated sample sizes: 


N = 102 
N per group = 51 


The estimated total sample size falls from 172 to 102; a one-sided test is 
more powerful. 


Example 4 (minimum TE): The power equation can be used to calculate 
the minimum TE required to achieve 7 = 0.80 when N = 200 and the 
control group mean is set at 21. 


. * (4) Required minimum TE size when m1=21; sd=6; N=200 
. power twomeans 21, sd(6) power(0.8) n(200) 


Performing iteration ... 


Estimated experimental-group mean for a two-sample means test 
t test assuming sdi = sd2 = sd 
HO: m2 = mi versus Ha: m2 != mi; m2 > ml 


Study parameters: 


alpha = 0.0500 
power = 0.8000 
N = 200 
N per group = 100 
mi = 21.0000 
sd = 6.0000 
Estimated effect size and experimental-group mean: 
delta = 2.3888 
m2 = 23.3888 


The estimated TE is 2.39, or around 11% over the base. 


Example 5 (power): In this example, we set the TE at 4 units (about 19% 
above the base level) and estimate the effect on power when the sample size 
is 100. 


. * (5) Power when m1=21; m2=25; sd=6; N=100 
. power twomeans 21 25, sd(6) n(100) 
Estimated power for a two-sample means test 


t test assuming sdi = sd2 = sd 
HO: m2 = mi versus Ha: m2 != mi 


Study parameters: 


alpha = 0.0500 
N = 100 
N per group = 50 
delta = 4.0000 
mi = 21.0000 
m2 = 25.0000 
sd = 6.0000 
Estimated power: 
power = 0.9100 


The estimated power is now 0.91. 


Example 6 (power): All else equal, a large variance in the outcomes 
reduces the power of the test. In this example, we raise the standard 
deviation by 50% to 9, from 6 in example 5. 


* (6) Power when m1=21; m2=25; N=100; sd=9; N=100 
. power twomeans 21 25, sd(9) n(100) 
Estimated power for a two-sample means test 


t test assuming sdi = sd2 = sd 
HO: m2 = mi versus Ha: m2 != mi 


Study parameters: 


alpha = 0.0500 
N = 100 
N per group = 50 
delta = 4.0000 
mi = 21.0000 
m2 = 25.0000 
sd = 9.0000 
Estimated power: 
power = 0.5950 


The estimated power drops sharply from 0.91 to 0.60. 


Most of the results given above are intuitive. Large sample sizes, one- 
sided tests, large ATEs, and low variability are all conducive to high power at 


a given desired test size. Conversely, when the ATE is small, then the type I 
error can be high if the samples are noisy and participation low. 


24.3.5 Stratified randomization and clustering 


The simple randomization design considered above randomizes the 
treatment across subjects. In practice, several alternative designs are widely 
used. For example, rather than randomly selecting classes in a region to be 
treatment or control, we may restrict analysis to specific schools and, within 
each school, randomly select some classes to be treated and some to be 
controls. Usually, the researcher can choose whether to randomize at the 
individual level or the group level; see Banerjee and Duflo (2011). 


Randomization at the group level may be easier and less costly to 
implement. At the same time, however, this will generally lead to correlated 
or clustered observations, reducing precision, and variance calculations of 
the TE should adjust for clustering. Another drawback of randomization at 
the group level is that spillovers from treatment to comparison groups are 
more likely, biasing the estimation of TEs. For example, a teacher in a treated 
class may share resultant knowledge with a teacher in a control class. 


With cluster correlation, the power calculations need to be adjusted. We 
assume nonzero intracluster correlation and zero intercluster correlation. 
Such a dependence pattern could arise from cluster-specific shocks and 
cluster-specific unobserved heterogeneity. The size and power calculations 
must be adjusted. The power twomeans options k1() and k2() denote the 
number of clusters, and m1 () and m2 () denote the cluster sizes for, 
respectively, control and experimental groups; kratio() equals k2/k1 and 
mratio() equals m2/m1. The option rho () specifies the intracluster 
correlation. 


Example 7 (clustered sample): We suppose that observations are no 
longer independent. Instead, they are in clusters of 5 observations with 
intracluster correlation of 0.2 and independence across clusters. We expect to 
require larger sample sizes given the reduced information when observations 
are no longer independent. 


. * (7) Required sample size when m1=21; m2=24; sd=6; cluster size 5; rho=0.2 
. power twomeans 21 24, sd(6) m1i(5) m2(5) rho(0.2) 


Performing iteration ... 


Estimated numbers of clusters for a two-sample means test 
Cluster randomized design, z test assuming sdi = sd2 = sd 
HO: m2 = mi versus Ha: m2 != m1 


Study parameters: 


alpha = 0.0500 
power = 0.8000 
delta = 3.0000 
mi = 21.0000 
m2 = 24.0000 
sd = 6.0000 
Cluster design: 
Mi = 5 
M2 = 5 
rho = 0.2000 
Estimated numbers of clusters and sample sizes: 
Ki = 23 
K2 = 23 
Ni = 115 
N2 = 115 


The sample size has increased from 128 in example 1 with independent data 
to 115 + 115 = 230. 


The community-contributed clsampsi command (Batistatou, Roberts, 
and Roberts 2014) considers treatment at the cluster level in a random- 
effects model yij = aD; + x;;8 + uj + ei; that includes regressors. It 
performs power calculations using the F distribution rather than the chi- 
squared distribution. Then, the user needs to provide the intracluster 
correlation coefficient o2 / (o? + a2); an estimate can be obtained using 
existing data and applying the loneway command to ordinary least-squares 
(OLS) residuals. 


24.3.6 Limitations of size and power calculations 


While size and power calculations are regarded as essential aspects of RCT 
design, practical application must overcome the difficulty that it requires 
typically unknown parameter values as inputs. Specifically, the variance of 
the outcome for the treated and untreated groups are unknown inputs. So, 


too, is intracluster-correlation if clustering is an issue. A partial solution to 
the problem is to use estimates from previous studies as inputs into size and 
power calculations or to run a pilot study in which such information is 
collected; see Glennerster and Takavarasha (2013) for details. 


24.3.7 Covariate balance 


A randomized control experiment should ensure that covariates are 
reasonably balanced, meaning that individuals in the treatment and control 
groups are similar; in the case of a stratified design, this balance is within 
strata. Controlling for other covariates will not affect the consistency of the 
TE estimate, but it can reduce its variance. Hence, including valid 
pretreatment regressors (variables that potentially may impact outcome) in 
the regression will increase power. 


It is important that the covariates be balanced because, otherwise, 
differences between treatment and controls may simply be due to differences 
in covariates. In fact, the goal of regression decomposition methods given in 
section 4.6 is to quantify the relative contributions of observable 
characteristics and unobservable factors, where here the unobservable 
factors would be attributed to the TE. 


If covariates are unbalanced in an RCT, then stratification, blocking, 
weighting, trimming, and matching methods can be used to improve balance. 
However, simultaneous stratification in several dimensions can be difficult 
and may reduce sample size. Stratifying (or blocking) ex ante is more 
efficient than controlling ex post because it ensures an equal proportion of 
treated and untreated units within each block and therefore minimizes 
variance. 


In later sections of this chapter, we extend the analysis beyond 
randomized experiments. Then the treatment and control groups can be quite 
unbalanced, and balancing becomes essential. We present a range of 
methods that have been developed. 


24.4 Regression in an RCT 


The preceding section considered the case in which the RCT yields data on ( 
Yii, Yoi, Di). Estimation of the ATE was based on comparison of the average 
outcomes of treated and nontreated groups. The basic tool of estimation and 
inference was the two-sample test of difference in means; see section 24.3.1. 


Now we consider adding regressors as controls to obtain a more efficient 
estimate of the ATE. With perfect random assignment, these covariates are 
independent of the treatment, provided they are measured before the 
treatment, so there is no change in the estimate of the ATE. But if the 
covariates partially explain the outcome variable, then the variance of the 
residual is decreased, leading to more efficient estimation. 


24.4.1 Adding covariates 
Recall from section 24.3.1 that the difference-in-means test can be estimated 


by OLS regression of the outcome y on an intercept and the treatment 
indicator D. 


The simplest way to add covariates x; is to add them as regressors in the 
OLS regression and fit the model 


yi =a +YyDi +x; Btu, t=1,...,N (24.4) 


Then a test of no TE is a test of the null hypothesis Ho : y = 0. 


An important unstated assumption used above is that x enters the 
regression linearly. In practice, one may know little about the correct 
functional form. Hence, the quest for greater power has to be balanced 
against the risk of misspecified functional form. 


It is important to ensure that the exogeneity assumption E (u|x, D) = 0 
is satisfied. One way to ensure this is to restrict the set x to contain only 


pretreatment variables. If posttreatment variables are included, there will be 
a strong suspicion that they may be correlated with u. 


In an RCT, x and D should be uncorrelated. However, some chance 
correlation in a finite sample cannot be ruled out. Because of sampling 
variation, the correlation between x and D in a given sample may not be 
exactly zero, but it will be approximately so in large samples. Therefore, the 
regression that includes x will yield a very similar estimate of 7 as the 
regression without; both estimates are consistent. 


If x has independent predictive value for y, then inclusion of these 
regressors will usually lead to a more efficient estimate of y. To see this 
improvement in efficiency, consider the simplest case of default OLS standard 
errors with variance matrix s?(Z’Z)~+, where z includes an intercept, D and 
(possibly) x, and s? = )\w?/(N — K,). Adding covariates leads to little 
change in the diagonal entry in (Z’Z)~! for D because by randomization, D 
is essentially uncorrelated with x. If the regressors x make a significant 
contribution to explaining the variation in y, then Y` @? will be significantly 
smaller, more than offsetting the increase in the degrees of freedom 
(N — K,). If the additional regressors x are irrelevant, however, then 
efficiency can be lower. 


Worse still, if the additional regressors are invalid and correlated with D, 
then their inclusion will bias the estimate of 7; this may seem unlikely in a 
carefully designed and implemented trial. This consideration might explain 
why one might prefer the simple two-sample ¢ test. 


24.4.2 Covariates interacted with treatment status 


A richer model than (24.4) is one that fully interacts the regressors with 
treatment status. 


yi =atyD,+x, 84+ (Dixi ð + ui, i=1,...,N (24.5) 


Interpretation of the effect of treatment status is made more difficult 
because of the interactions that introduce nonlinearity, discussed in 


section 24.4.5. The model can be fit with the command regress y 

i.D##c.x, Where factor variables are used. The margins D command yields 
the sample average predicted outcome when D = 1 and when D = 0, and 
the ATE is the difference in these means. The ATE and its standard error can be 
directly obtained using the margins, dydx(D) command. 


The interaction model (24.5) can be shown to yield the same results as 
the RA method (see section 24.6.1) that runs two separate OLS regressions of 
y on x, one for the treated sample and one for the control sample; see 
Imbens and Rubin (2015, 127). 


24.4.3 Covariate balance 


When TEs are estimated under an RCT, or even more importantly, using 
observational data, attention is given to covariate balance, a principle that 
ensures that the treated and control groups are comparable in terms of 
covariates. Absent such balance, a potential exists for different 
interpretations and paradoxes, such as Simpson’s paradox. 


Under the regression approach, there is no explicit attempt to obtain 
covariate balance. Conditional upon including an appropriate set of 
covariates in the conditional mean function, it is assumed that we have 
covariate balance. This expectation may not be valid; some categories of 
individuals may not be equally present in treated and control groups. 
Inverse-probability weighting and matching methods, presented in 
section 24.6, seek to adjust for this imbalance. 


24.4.4 Simulation-based example of regression for an RCT 


In this section, we illustrate the usefulness of RA in the context of a 
simulation-based RCT example with randomly assigned treatment and a 
single covariate that is uncorrelated with treatment but is correlated with the 
outcome. We initially fit the simpler model (24.4) before moving to the 
interaction model (24.5). 


The variables (x, D) are generated independently. D is a Bernoulli draw 
with probability 0.5, so assignment to treatment is random. The covariate x 


is a draw from the N(10, 5°) distribution. The dependent variable is 
generated as y= 1 + z + 2D + u, where u is a standard normal draw. The 
sample size is set at 100. 


* Simulated RCT: Exogenous covariate x; similar treat & control sample size 
. qui set obs 100 


set seed 10101 
. generate x = rnormal (20,5) // Exogenous regressor 
set seed 10102 
. generate D = rbinomial(1,0.5) // Treatment assignment 
set seed 10103 
. generate u = rnormal(0,1) // Model error 
. generate y = 1 + x + 2*D + u // Outcome varies with treatment 


summarize x D y u 


Variable Obs Mean Std. dev. Min Max 
x 100 20.27752 5.234358 4.146819 34.05265 
D 100 .48 .5021167 (0) 1 
y 100 22.33587 5.227955 7 . 290004 35.98082 
u 100 .0983549 .9576001 -2.26003 2.745963 

. pwcorr x D y u, star(0.05) 
x D y u 

x 1.0000 
D -0.1034 1.0000 
y 0.9656* 0.0845 1.0000 
u -0.0860 -0.0224 0.0927 1.0000 


The data summary shows that in spite of the design, there is slight 
correlation (— 0.103) between z and D and slight correlation (— 0.086) 
between x and u, though these correlations are not statistically different 
from 0 at level 0.05. By construction, there is high correlation between y and 
x, so adding x as a control will greatly improve precision. 


Difference in means 


The analysis begins with a difference-in-means test. 


* ATE: Use ttest command (no controls) 
. ttest y, by(D) unequal 


Two-sample t test with unequal variances 


Group Obs Mean Std. err. Std. dev. [95% conf. interval] 

0 52 21.91365 7119351 5.133837 20 . 48438 23.34291 

1 48 22.79328 . 7713672 5.344189 21.24149 24 . 34507 
Combined 100 22.33587 .5227955 5.227955 21.29853 23.37321 


diff -. 8796343 1.049695 -2.9631 1.203831 

diff = mean(0) - mean(1) t = -0.8380 
HO: diff = 0 Satterthwaite’s degrees of freedom = 96.5876 

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 
Pr(T < t) = 0.2021 Pr(|T] > Itl) = 0.4041 Pr(T > t) = 0.7979 


In the generated sample, 48 were randomly assigned to treatment. The ATE is 
22.79 — 21.91 = 0.88, compared with the true value of 2. The ttest 
command uses the second group, here the treated group with D = 1, as the 
reference group. This leads to the sign reversal and estimated difference of 
— 0.88. 


Virtually identical results are obtained by OLS regression of y on D and 
an intercept. 


. * ATE: OLS regress y on D (no controls) 
. regress y D, vce(robust) 


Linear regression Number of obs = 100 
F(1, 98) = 0.70 
Prob > F = 0.4041 
R-squared = 0.0071 
Root MSE = 5.2358 
Robust 
y | Coefficient std. err. t P>|t| [95% conf. interval] 
D 8796343 1.049643 0.84 0.404 -1.203348 2.962617 
_cons 21.91365 .7122145 30.77 0.000 20.50028 23.32701 


As expected, the estimated coefficient of D is again 0.8796343, with 
standard error 1.049643, which differs from that from the ttest command in 
the seventh significant digit because of slightly different degrees-of-freedom 


correction. The 95% confidence interval includes 0. The low R2 of 0.007 
confirms that the test will have low power. 


Adding covariates 


The two-sample ¢ test given above shows that there is very weak evidence of 
a Statistically significant TE, with p = 0.40. The main problem is that the test 
has low power because the underlying relationship between y and D is noisy 
with correlation 0.0845. 


Adding «x, which is highly correlated with y though uncorrelated with D, 
would greatly improve the fit of the model and the power of the test given 
the data-generating process (DGP) of this example. We begin with the simpler 
model (24.4). 


. * ATE: OLS regress y on D and x added as control 
. regress y x D, vce(robust) 


Linear regression Number of obs = 100 
F(2, 97) = 1752.27 
Prob > F = 0.0000 
R-squared = 0.9667 
Root MSE = . 96335 
Robust 
y | Coefficient std. err. t P>|tl [95% conf. interval] 
x . 9836596 .0167009 58.90 0.000 .9505129 1.016806 
D 1.939621 . 1918818 10.11 0.000 1.558788 2.320453 
_cons 1.45868 . 3822349 3.82 0.000 . 7000495 2.217311 


This boosts the 2 to 0.97 and produces a TE estimate of 1.9396 that is 
statistically significant at level 0.05. It is also not significantly different from 
the DGP value of 2 at level 0.05 because the 95% confidence interval includes 
2. 


Adding interaction of treatment status with covariates 


We now fit the regression model (24.5) with regressors interacted with the 
treatment indicator using factor-variable operators. 


. * OLS regress y on D and x and interaction of D and x 
. regress y i.D##c.x, vce(robust) noheader 


Robust 
y | Coefficient std. err. t P>|t | [95% conf. interval] 
1.D 1.643602 . 7424357 2.21 0.029 . 1698785 3.117325 
x .9764098 .0278316 35.08 0.000 .9211645 1.031655 
Ditc.x 
1 .014617  .0343537 0.43 0.671 - .0535746 . 0828087 
_cons 1.609437 . 6047906 2.66 0.009 . 4089372 2.809937 


The margins command using the predictive margin method of 
section 13.6 yields the sample average predictions when D = 1 and D = 0. 


. * Average predicted outcomes for D=0 and D=1 from interactive regression 
. margins D // POMs 


Predictive margins Number of obs = 100 
Model VCE: Robust 


Expression: Linear prediction, predict() 


Delta-method 


Margin std. err. t P>|t | [95% conf. interval] 
D 
(0) 21.4086 . 1467133 145.92 0.000 21.11738 21.69983 
1 23.3486 . 1252627 186.40 0.000 23.09996 23.59725 


The estimated ATE is the difference 23.3486 — 21.4086 = 1.9400. 


The margins, dydx(D) command directly yields the ATE estimate. 


* ATE: Using margin, dydx command after interactive regression 
. margins, dydx(D) // ATE 


Average marginal effects Number of obs = 100 
Model VCE: Robust 


Expression: Linear prediction, predict() 
dy/dx wrt: 1.D 


Delta-method 
dy/dx std. err. t P>|tl [95% conf. interval] 


1.939999 . 1929133 10.06 0.000 1.557069 2.322929 


Note: dy/dx for factor levels is the discrete change from the base level. 


The estimate of 1.9399 is very close to 1.9396 obtained using the simpler 
model (24.4), and the precision is similar with standard error 0.193 
compared with 0.192. This is expected because the interaction term did not 
appear in the DGP of this example. The estimate at level 0.05 is statistically 
significant and is not significantly different from the DGP value of 2. 


The same results are obtained using the RA method of section 24.6.1 and 
are more directly obtained using the teffects ra command, detailed in 
section 24.7 and illustrated in section 24.9. We have 


. * ATE: Equivalent results using teffects ra command 
. teffects ra (y x) (D), ate nolog 


Treatment-effects estimation Number of obs = 100 
Estimator : regression adjustment 
Outcome model : linear 
Treatment model: none 
Robust 

y | Coefficient std. err. z P>Izl [95% conf. interval] 
ATE 

D 

(1 vs 0) 1.939999 . 1890464 10.26 0.000 1.569475 2.310523 

POmean 

D 

(0) 21.4086 . 5289239 40.48 0.000 20.37193 22.44527 


The same estimate of 1.93999 is obtained, while the standard error differs 
slightly because of estimation of two separate equations for treated and 
untreated rather than a single equation for both treated and untreated. The 
option atet would yield the estimated ATET. 


24.4.5 Marginal effects and nonlinearity 


In section 13.7, we explored the connection between TEs and marginal 
effects. This is a fruitful connection to revisit because Stata’s powerful 
margins command provides a convenient way to estimate TEs in nonlinear 
models. 


The regression framework is very convenient for estimating TEs even if 
the relevant data are derived from an RCT. However, in most cases, this step 


should accommodate complications that result from relaxing distributional 
and functional form assumptions. 


In the simpler linear model (24.4) with y; = a+ yD; + xi G + uj, the 
ATET, ATE, and marginal effect are constant and equivalent, equaling Y. In 
nonlinear models, irrespective of the assignment mechanism, this 
equivalence does not hold. Given nonlinearities in variables or parameters, 
coefficient estimates are harder to interpret. The marginal TE provides the 
building block for TE. 


A marginal TE measures the effect on the conditional mean of y of a 
change in treatment variable D, which can be calculated by calculus or finite 
differences depending on whether D is continuous or discrete. When the 
marginal TE is a function, as in nonlinear models, the function has to be 
evaluated using one of three common points of evaluation: 1) at sample 
values of regressors and then averaged; 2) at the sample mean of the 
regressors; and 3) at representative values of the regressors. These tasks can 
be accomplished using Stata’s powerful margins postestimation command, 
as shown in the example in the preceding subsection. 


Another useful tool for computing TEs in nonlinear models is to use the 
predict postestimation command to generate individual-level predictive 
means (PMs) and predictive margins. Given estimated regression 7 = x’ B, 
the conditional mean E(y|x = x*) = x" is called the PM. It is convenient 
to create profiles of the PM by evaluating it at specified values of x, say, x*, 
and then contrasting these across treated and control groups. Such PM 
calculations are easy and flexible to produce. More details are provided in 
section 13.6. 


24.4.6 Cluster correction 


If the RcT design involves randomization across clusters, appropriately 
defined, then (24.4) becomes 


ty =q PRO Pg Fg Clee NG GH lei G 


where each observation is a member of one of G clusters and the gth cluster 
contains N, observations, so N = >/, Ng. 


Here the presence of a cluster-specific random effect Yg implies that the 
outcomes are conditionally correlated within each cluster but not between 
clusters. As noted in section 24.3.6, the cluster-induced correlation will 
reduce the efficiency of estimation of y. The clustered power calculations 
assumed equicorrelation within cluster, consistent with a random-effects 
model with Yg and Ug 1.1.d. 


For estimation, however, we can use the usual cluster—robust correction. 
The commands teffects include a vce (cluster clustvar) option, where 
the unique cluster identifier is defined by the stratum level at which the 
randomization was done. This approach was adopted in making estimating 
cluster—robust standard errors in section 6.7. Any other form of stratified 
randomized treatment can be handled in the same way. 


24.4.7 Design-based inference 


Throughout this book, we have used sampling-based inference based on 
random sampling of data such as (y;, Di, Xi) or (Yg, Dg, Xq)- 


Design-based inference additionally controls for randomness due to 
treatment assignment. Abadie et al. (2020) consider the independence case 
and find that the usual sampling-based heteroskedastic—robust standard 
errors can be somewhat conservative. For clustered experiments, Abadie 
et al. (2022) find that in some settings the usual sampling-based cluster— 
robust standard errors can be exceptionally conservative and propose two 
alternative standard error estimators in the case of a binary treatment. The 
difference in standard errors arises when treatment assignment varies within 
each cluster and average treatment effects differ across clusters. In that case, 
the difference additionally increases with the number of observations per 
cluster and the fraction of the population clusters that are sampled. 


When randomized treatment is assigned solely at the cluster level, Su 
and Ding (2021) propose design-based inference for a variety of estimators, 
such as weighted regression estimators based on cluster averages. 


In this chapter, we consider only sampling-based inference. For RCTs, it 
provides inference that is valid, albeit potentially quite conservative in some 
clustered settings. The design-based approach is not presented in detail here 
because of its newness at the time of writing this text. 


24.4.8 RCT as a “gold standard” of treatment evaluation 


The RCT methodology has been considered a benchmark for judging methods 
based on observational data. However, it also has serious critiques; see, for 
example, Deaton (2010). 


In addition to the limitations mentioned in earlier sections, critics of RCT 
emphasize that, like several other approaches such as the potential-outcomes 
model used for experimental and observational data, the approach has a 
black box character. The analysis focuses on estimating the size of the 
impact but not on the mechanism or the process through which the impact 
comes about. Whereas an RCT may confirm that the intervention “worked” 
(or not), it may be less informative about whether it might also work in other 
different, larger environments, which is an issue of external validity. 
Econometric models that incorporate additional information about the 
mechanism behind the impact if reasonably correctly specified can provide a 
better basis for prediction and extrapolation. 


24.5 Treatment evaluation with exogenous treatment 


The presentation so far has focused on treatment evaluation for RCTs. Now 
we consider extension to a wider range of settings, where treatment 
assignment is not necessarily purely random or purely random within strata. 


The idea is to control for individual characteristics in such a way that it 
is possible to interpret the difference in average outcomes across treatment 
and control groups or subgroups as being the causal effect of the treatment, 
even if data are observational rather than generated by an RcT. This greatly 
widens the possible application of treatment evaluation methods. It rests on 
assumptions, some nontestable, that we now detail. 


Using the potential-outcomes framework, for each individual, we 
observe treatment D, covariates x, and one of the two potential outcomes 
Yo (if not treated) and y1 (if treated). 


To identify TEs, we assume the following 


1. Conditional independence assumption: Conditional on x, the 
treatment assignment and potential outcomes are independent, written 
as 


Yo, Yı Jl D|x 


2. Overlap or matching assumption: For each value of x, there are both 
treated and nontreated cases, written as 


0< Pr(D = Lin 


Assumption 1 generalizes purely random assignment for which 
yo, yı L D and is also referred to as the unconfoundedness assumption, the 
ignorability assumption, or the selection-on-observables assumption. It is 
equivalent to the assumption that after control for covariates x, there are no 
omitted variables that may lead to correlation between treatment and 


potential outcomes. For example, if people self-select into a training 
program, believing it will increase their earnings, then it is assumed that 
this self-selection is fully captured by the covariates (observable 
characteristics) and not by unobservables also correlated with the potential 
outcomes. Consequently, there is assumed to be no selection bias or 
endogeneity bias. The contrary case of selection on unobservables is 
considered in the next chapter. 


The unconfoundedness assumption is nontestable, analogous to 
nontestability of the validity of the instrument in a just-identified model. In 
any particular application, especially one using observational data, one 
needs to explain why this is a reasonable assumption. 


In some cases, assumption 1 can be relaxed to yo | D|x, which implies 
conditional independence of participation and yo, or relaxed even further to 
the conditional mean independence assumption 


E(yi|D = 1,x) = E(yi|D = 0,x) = E(yi|x) 


E(yo|D = 1,x) = E(yo|\D = 0,x) = E(yo|x) (24.6) 


which implies that the mean potential outcomes are unrelated with the 
treatment assignment. 


Assumption 2 can be interpreted to mean that every unit in the sample 
has a positive probability of receiving treatment and there are no units that 
are certain to be treated or to be not treated. Furthermore, this probability is 
bounded away from 0 and 1, so that the regressors x do not completely 
determine assignment to treatment or control. The probability of treatment, 
Pr(D = 1|x), is called the propensity score. 


To these assumptions, we also add an assumption that there are no 
general equilibrium effects due to the policy intervention. The stable unit- 
treatment value assumption is that an individual’s potential outcome does 
not vary with treatments assigned to other individuals. 


Define § as the difference between the outcome in the treated and 
untreated states for an individual, a quantity that is not observable. Then the 
ATE is the expected value of §, and the ATET is the expected value of § given 
treatment. We have 


ô = yı — Yo 
ATE = E (ô) 
ATET = E(6|D = 1) 


The ATE parameter is identified under assumptions | and 2. The ATET 
parameter is identified under conditional mean independence (a weaker 
version of assumption 1) and assumption 2. 


Assumptions 1 and 2 hold in a well-designed and executed RCT. They 
might additionally hold in nonexperimental settings that rely on 
observational data. Hence, the methods considered in the following sections 
will also apply to observational data when a convincing case can be made 
that selection on unobservables is not a major hurdle. 


For continuous outcome, the OLS estimator is consistent under the 
assumptions listed here, provided the covariates x enter in a sufficiently 
flexible way. The regression approach is not particularly robust, however, 
especially if there is substantial difference in average characteristics of 
treated and controlled individuals, as is often the case for studies using 
observational data, because it requires correct specification of E(y|D,x). 
We now present several methods that promise more robust estimates of TEs. 


24.6 Treatment evaluation methods and estimators 


We present the leading methods for estimating TEs, where TEs are 
heterogeneous, meaning that different individuals may have a different 
response to treatment, so the goal is to estimate ATEs. 


These methods fall into the broad classes of regression, inverse- 
probability weighting, doubly robust, and matching. We also consider the 
complications of regressor balance, propensity-score overlap, and blocking. 
The Stata teffects commands that implement these methods are presented 
in section 24.7 and are implemented in section 24.9. 


We present formulas for binary treatment; see [TE] Stata Treatment- 
Effects Reference Manual for more general formulas that cover multiple 
treatments. For models of potential outcomes, we focus on predictions using 
linear regression. More generally, other models might be used. Then x’ 3, 
and x’ Bo are replaced by predictions y,; and yo; for continuous outcomes, 
predicted conditional means E(y, i[x;) and E(yo;|x,) for counts, and 
predicted probabilities for binary and ordered discrete outcomes. 


24.6.1 Regression adjustment 


A richer model than (24.3), called RA, runs two separate OLS regressions of y 
on x, one for the treated sample and one for the control sample. 


E(yu) =xi8, if D;=1 
E(yoi) = X;ßbo if Di =0 


These models are then used to predict outcomes for all observations, not just 
for the subsample used in estimation. For example, the first regression 1s 
used not only to predict Yı: for treated observations, in which case Y1; is 
observed, but also to predict Y1: for control observations, in which case Y1; is 
not observed. So we are imputing the unobserved counterfactual. 


Denote these predictions as, respectively, 41; and Yo;. The averages of 
these estimates are estimates of the potential-outcome means (Poms). The 
estimated ATE is the average of the difference between the two estimated 
POMs. And the estimated ATET is the average of the difference between the 
two estimated POMs when we average only over the N; observations that 
were treated. 


POM, = 4 Je Yii 

POMo = + a You 

ATE = 4 — 4 os 

ATET = Ny ys Vii = Ny ae Yoi 


The teffects ra command provides these regression-adjusted estimates. 
The default reports just the ATE estimates; option pomeans reports the 
estimated POMs, and the atet option reports the estimated ATET. The same 
estimates can be obtained using the eregress command and the estat 
teffects postestimation command; see section 24.7. 


For simplicity, we consider only OLS regression. More generally, the RA 
method used may vary with the outcome; logit regression may be used for a 
binary outcome, Poisson regression for a count outcome, and so on. 
Semiparametric and parametric methods could also be used to estimate 
E(yri|xi) and E(yoi|xi). 


Regression methods were presented at length in section 24.4, in the 
context of an RCT with random assignment of treatment. Then the addition of 
regressors is relatively innocuous and is actually unnecessary; the motivation 
is to obtain a more precise estimate of the TE. When RA is used with 
observational data, however, the choice of covariates and functional form 
becomes much more important because covariates are being used to ensure 
that treatment assignment is random after including the covariates as 
controls. Thus, the regression estimator is not a robust estimator in most 
observational data settings. The following estimators may be better. 


24.6.2 Inverse-probability weighting 


The probability of receiving treatment may vary across individuals. This is 
very likely when observational data are used and can even happen in an RCT 
such as that analyzed in sections 24.8 and 24.9. 


Inverse-probability weighted (Pw) regression handles this complication 
by reweighting the data prior to regression, so that weighted observations 
more nearly satisfy the assumption of equal probability of receiving 
treatment. The weight assigned to a subject is the reciprocal of the 
probability of receiving treatment. The method was used to handle missing 
data due to panel attrition in section 19.10.3. 


Denote for individual ; the probability of treatment, the propensity score, 
by 


p(z) = Pr DD; = 1l|z;) 


and let 1 — p(z;) denote the probability of no treatment. Note that we use z 
for the propensity-score variables to distinguish them from the controls x 
used in RA methods. Some or all variables may be in both x and z. 


The IPw estimate of the ATE is given by 


For an RCT with constant treatment assignment probabilities p(z;) = p and 
with treated and control samples of sizes N; = pN and No = (1 — p)N, the 
formula simplifies to the difference in means yı — Yo. The teffects ipw 
command implements this estimator. 


A brief justification for this method is the following. Consider the first 
term, and note that D;y; = Diy; because Diyo; = 0. Conditioning on the 


controls z, we have 


B (Tha) = E (22a) = £ (Pte) x E (nle) = E (onle) 


where the second-to-last equality requires the conditional mean assumption 
(24.6) and the last equality uses E(D|z) = p(z) for the binary variable z. By 
similar algebra, E|(1 — D)y/{1 — p(z)}|z] = E(yo|z). 


Implementation uses a flexible model for the propensity score p(z). 
Logit or probit models are often used, though semiparametric and 
nonparametric models can also be used. 


The model that generates p(z) needs to provide a good fit to the data; 
interpretation and testing of coefficients is not the objective. To realize a 
good fit, one commonly uses a flexible specification that includes powers of 
z and interactions. This implies that the conditional probability model is 
typically not parsimonious and there is a risk of overfitting. For example, in 
the extreme case of more regressors than observations, we will necessarily 
perfectly fit the data. Methods for choosing a subset regression from the full 
set of potential regressors include stepwise subset selection and the lasso; 
see section 28.3. The key objective is to drop variables that contribute little 
to the fit of the model and hence can be dropped at small risk of 
misspecification. 


24.6.3 Propensity-score overlap 


Limitations of Ipw derive from a potentially poorly specified conditional 
probability model and from numerical instability that may arise from having 
many fitted values in a small neighborhood of 0 or 1 because these values 
are given larger weight. 


Instability may arise also from the failure of the overlap assumption; that 
is, 0 < p(z;) < 1. Such failure can happen even in a correctly specified 
conditional probability model. The teoverlap command provides checks 
and a test of the overlap assumption after weighting. 


Similarly, if propensity-score matching is used (see section 24.6.7), we 
need the distributions of the propensity scores by treatment status to be 
similar. 


24.6.4 Regressor balance 


In the simplest RCT, treatment assignment is independent of any covariates, 
so covariates are independent of the outcome. In observational data, by 
contrast, treatment assignment will be related to covariates that also affect 
the outcome of interest. A well-specified model for Pw or matching should 
balance the covariates. 


For a single regressor z, let Z; and Zo denote the treated and control 
sample means, and let 52, and 52, denote the corresponding sample 
variances. Two measures of regressor difference across treatment groups are 
the standardized difference, (Z1 — Zo)/ y (35 he B 72; and the variance 


ratio 524/320- 


For Pw estimators, we want to find that, after inverse-probability 
weighting, the standardized difference of each regressor is close to 0 and that 
the variance ratio is close to 1. Let p; denote the propensity score p(z,;) and 
zi denote a single regressor that is an element of Z;. Then the tebalance 
command gives the weighted mean and variance for the treated and controls 
defined as 


N N 
1 
zZ =T D 4 Zi = 1 D i 
l,w N; 3 Wiz Z0,w N, > i)Wwiz 
i oa 1 
Si => N; — 1 D ~~ Ziw) S3 w = No _ 1 a a D;)wi(z; 20,0) 
wi = 1/pi wi =1/(1-— pi) 


For nearest-neighbor matching estimators, presented later, the 
corresponding formulas for 21 w, Z0,w, S{ w» and SI w are given by 
[TE] tebalance. 


24.6.5 Doubly robust augmented IPW and IPW regression adjustment 


Two different hybrid methods combine regression models for the potential 
outcomes with inverse-probability weighting. These two methods have the 
advantage of being doubly robust because consistency requires either that 
the models for the potential outcomes be correctly specified or that the 
propensity-score model be correctly specified. So one of these can be 
misspecified. By contrast, the RA estimator uses only potential-outcome 
models, and these must be correctly specified, and the IPW estimator uses 
only a propensity-score model, and this must be correctly specified. 


The augmented inverse-probability weighted (AIPw) estimator modifies 
the RA estimator by adding a weighted residual term and, for the linear 
model, estimates the potential outcomes by 


POM, = =X. 
1 N D(z) x; By 
N 1- Di) (  — x1Bo ) 
1 ( 1 Yi i0 n~ 
POM = — -X; 
° N 1- f(z) mn 


where p(z;) is a first-step estimate of Pr( D; = 1|x;) such as from a logit 
model and @, and 8, are the RA estimates given in section 24.6.1. The alpw 
estimate of the ATE equals POM, — POMp- The teffects aipw command 
implements this estimator. 


The IPw-RA estimator, presented in Wooldridge (2010, 930), modifies the 
RA estimator by obtaining predictions from regressions that weight 
observations by the inverse probabilities. For linear models, we obtain 
estimates 3, and Bo by the separate weighted least-squares estimations that 
minimize 


N 
So wil — xi8,)° w; = 1/p(z:) if Dj = 1 
i=l 


N 
X wi(yor — X19)? wi =1/{1—plz)} if Di =0 
i=1 


where p(z;) is a first-step estimate of Pr( D; = 1|z;) such as from a logit 
model. The IPW-RA estimate of the ATE ; then iak 

(1/N) ye ix ic (1/N) ye eX, Bo» where the two sums are over the 
entire sample. The teffects ipwra command implements this estimator. 


While both the AIpw and IPw-RA estimators are doubly robust, the term 
“doubly robust estimate of ATE” usually refers to the AIPw estimator. 


The aw estimator satisfies an orthogonality condition that enables use 
of the lasso to select a subset of potential control variables; see 
section 24.6.10. 


24.6.6 Matching methods 


The essential idea underlying matching methods is that one wants to 
construct a cell whose occupants constitute a matched group based on 
control variables. Given a matched set, the average cell difference in treated 
and untreated outcomes can be computed. An exact method would be to 
have one-to-one matching, in which every treated subject has an untreated 
counterpart with identical control variables. To implement an exact match, 
one must use discrete regressors. In practice, one also has continuous 
regressors for which exact matching is not feasible. Hence, exact matching is 
too stringent a condition when there are many control variables or even just 
a few control variables if they take many different values. 


In general, the following issues have to be addressed: regressors include 
both discrete and continuous variables; some cells may be sparse or even 
empty; and matching may be one to one or one to many. The larger the 
number of regressors in the model, the more compelling these issues 
become. 


A successful match means that there is at least one untreated subject who 
matches a treated subject, that is, a counterfactual exists. Whether there is a 
match is determined by applying a matching criterion that is a measure of 
the distance between the treated and untreated subjects. Given many control 
variables, we face a dimensionality problem. 


One solution to the dimensionality problem is to replace the regressors 
by a one-dimensional function of the regressors and use the value of the 
function to define a match. The propensity score, considered in the next 
section, 1s one such matching criterion. 


Alternatively, one can use a combination of exact matching for some 
discrete regressors and closeness in distance, often Euclidean distance, for 
other regressors. This nearest-neighbor matching (NNM) estimator is 
presented in section 24.6.9. 


For any given subject, a specified matching criterion may not be 
satisfied. This means that there is no counterfactual available for estimating 
the TE. This could happen if there is insufficient overlap of the distribution of 
regressor values between the treatment groups. In the interests of obtaining 
better balance, or sufficient overlap, such an observation may be dropped. 
The resulting trimmed (smaller) sample is expected to yield a less biased 
estimate of the TE. Smaller bias is traded off against a wider confidence 
interval resulting from higher variance due to shrinkage in sample size. 


Before presenting matching methods, we note that in preceding sections, 
there was reason to distinguish between control variables x used in RA and 
control variables z used in inverse-probability weighting because x and z 
need not coincide. For matching methods, any control variables are used 
solely for matching, and we use x to denote the variables used in matching. 


24.6.7 Propensity-score matching 


An inexact matching method is based on the propensity score estimated for 
each subject in the sample. The appeal of propensity-score matching (PSM) 
derives from the established result of Rosenbaum and Rubin (1983), who 
showed that if treated and untreated units have the same propensity score, 
then this is equivalent to them having the same distribution of the regressors 
in the conditional probability model that generates the propensity scores. 


The propensity score, the conditional probability of receiving treatment, 
is presented in sections 24.6.2 and 24.6.3. As already noted, here we use x to 
denote the matching variables, so the propensity score is 
p(x;) = Pr(D; = 1|x;) and is estimated by binary outcome regression of D 


on x using the full sample. The fitted values p(x;) are then partitioned into 
subsets that define matched groups. 


Stata implements PsM-based treatment evaluation using the teffects 
psmatch command. The specific procedure follows a two-step method of 
Abadie and Imbens (2006). All treated and control units are matched with 
replacement, so different treated units may share a common matched 
untreated unit. The match may be one to one or one to many. The Euclidean 
metric is used to find the closest match or matches. The N-sized matched 
sample can then be used to estimate the ATE and ATET. Because PSM is based 
on an estimated measure, the variance estimate of the TE needs to be adjusted 
for this additional source of variation; see [TE] Stata Treatment-Effects 
Reference Manual. 


24.6.8 Blocking and stratification 


Given the propensity scores, the next step is to generate matches such that a 
balanced sample with satisfactory overlap is obtained. This requires some 
form of bracketing (or smoothing) to create matched pairs or matched sets. 
Stratification or interval matching is based on the idea of dividing the range 
of variation of the propensity score in intervals such that within each 
interval, the treated and control units have, on the average, the same 
propensity score. The ATE is the weighted average of these differences. 


Denote by b the blocks defined over intervals of propensity score. Then 
the TE within bth block is defined as 


1 1 
ATE, = —— Dy; - — 1 — Di) y; 
> No 2 7 Nob A ) 


where I (b) is the set of units in block b, Ny, is the number of treated units in 
the pth block, and No, is the number of control units in the pth block. Then 
the TE based on stratification is defined as ATES = S7?_, ATE 

IY e 1(6) Di / (> vi Di)}, where the weight for each block is given by 


the corresponding fraction of treated units and where B is the total number 


of blocks. The overall TE is a weighted average of block-specific TEs. The 
same approach is used to estimate ATET, but the averaging is over the subset 
of treated individuals only. 


An alternative estimation method is regression based. Given the blocked 
data structure, a linear regression of Y; on an intercept, D; and Xi, is 
estimated for i € J(b), and the block-specific ATEs is given by the coefficient 
of D. The overall TE, ATE, is the weighted sum ae WpATEb, where the 
weights w are either the proportion of units in the block or the proportion of 
treated units in the block. 


The number of blocks B and the boundary points of each block, that is, 
the interval width, are choice parameters. To generate a matching set, one 
may specify a minimum number m of required matches. For every 
observation i = 1,...,.N, such that for every 7 receiving treatment 1 or 0, 
there are h; > m matching observations in the block. The matched set 
consists of the propensity score p(x;) and h; propensity scores of subjects 
who received the other treatment. 


Specifying a small value of m means that the matched set is closer to the 
ith observation. This implies a smaller bias in the estimate of ATE but a larger 
variance. Increasing m implies reducing the variance of the estimate 
potentially at the cost of increasing the bias. 


24.6.9 Nearest neighbors matching 


A second alternative to exact matching is NNM. This can also be interpreted 
as a variant of PSM. The NNM is obtained by creating a matched set based on 
the closeness of the vector of regressors x; to the vector X;, where the 
subject j received the other treatment. To make this operational, we need to 
choose a metric of closeness of two vectors. 


A standard choice is the scaled or weighted Euclidean (“Mahalanobis”) 
distance metric, defined as (x; — x;)’S~'(x; — x;), where (x; — xj) is the 
difference in the vector of K regressors and § is the K x K matrix of 
variances and covariances of elements of x. As in the case of PSM, a 
minimum number of NNMs may be specified. 


A matching criterion is based on a measure of closeness of two 
observations. A more stringent criterion defines a smaller catchment area for 
an acceptable match. In the terminology of nonparametric statistics, a 
criterion is characterized by a bandwidth parameter; see section 2.6.6. The 
choice of bandwidth, termed a “caliper” in matching applications, is the 
most important component in selecting the acceptable matching 
observations. Choosing m nearest-neighbor observations as matches also 
implicitly uses a bandwidth. 


Another variant of NNM is radius matching in which 
Ai{p(x)} = {p; : (pi — p;)? <r} defines the neighborhood based on 
estimated propensity scores. This means that all control cases with estimated 
propensity scores falling within radius r are matched to the jth treated case. 


The counterfactual is generated by taking a weighted average of the 
outcomes in the matched set of subjects who received the other treatment. 
The TE for subject ; who received treatment | is 6; ynm = (yi — Y4, )» Where 


Ya, is the weighted average of outcomes in the reference group consisting of 
the nearest-neighboring subjects who received the treatment 0. As in the case 
of PSM, both ATE and ATET can be computed using the estimates §; NNM: 


This estimator of ATE (and ATET) is potentially biased if matching is based 
on more than one continuous variable. A second step makes a regression- 
based bias correction to the estimated potential outcome. The bias-corrected 
ATE is estimated using bias-adjusted estimates of the potential outcome. 


The teffects nnmatch command implements PsM-based treatment 
evaluation. 


24.6.10 Machine-learning methods 


The methods of this chapter rely on the assumption of unconfoundedness. 
These methods become more plausible the better the set of control variables 
used. 


Section 28.8 presents in detail the use of machine-learning methods to 
accomplish this. These methods overcome the overfitting inherent in data 


mining by basing estimation on moment conditions with an 
orthogonalization property (see section 28.8.8) and using sample splitting; 
see section 28.8.9. 


Section 28.8.3 presents machine-learning methods for estimating a in the 
simple linear model (24.4); in that case, the TEs are homogeneous. 
Section 28.9.4 presents machine-learning methods for estimating ATE in the 
heterogeneous effects model using the doubly robust Arpw estimator. Many 
more applications will be developed over time. 


24.6.11 Assessing unconfoundedness 


TEs analysis with observational data relies on the assumption of 
unconfoundedness. This is a nontestable assumption, meaning that one 
cannot confirm that this assumption is valid. Imbens and Rubin (2015, 
chap. 21) present several tests whose failure rejects the plausibility of 
unconfoundedness, even though their passing does not confirm 
unconfoundedness. The ability to perform such checks is very dependent on 
the problem at hand. 


Some tests rely on changing the analysis in such a way that the estimated 
TE should be zero. Such changes should not be pure noise, so yo, y1 L D, but 
should instead be ones that require controlling for the covariates used in the 
original analysis. 


One method introduces pseudo-outcome variables that have a nonzero 
ATE without controls but should have zero ATE after controlling for 
observables using the same variables and methods as used in the original 
analysis. For example, if we had data on pretreatment levels of y for the 
treated and control individuals, then we could repeat the analysis using one 
or more of these pretreatment levels as the outcome variable. Because these 
outcomes occurred before the time of the treatment, there should be no TE. 
(And if in this case Y—1, say, is already included in the control variables, then 
we could use y—2 as the pseudo-outcome. ) 


A second method requires being able to partition the control individuals 
into two groups, one of which has characteristics reasonably similar to the 


treated group. Then we expect a zero ATE when we treat one of the two 
control groups as the treated group. 


Other tests are ones of the robustness of results to changes in control 
variables that should lead to little change in the estimated ATE. For example, 
if data on y are available for several pretreatment periods and are used as 
control variates, then adding or dropping one of these pretreatment values 
should make little difference. 


24.6.12 Robust confidence intervals for ATE 


Many of the TEs estimates are obtained by joint estimation of several 
equations, where each equation involves an m-estimator. The teffects 
commands obtain standard errors by stacking the equations and applying 
standard asymptotic results to the system. 


For example, consider estimating the ATE for a continuous outcome with 
linear RA model for the outcome variable and a logit model for a binary 
treatment. Suppose this model is fit using the command teffects ipwra (y 
$x) (D $z, logit), where $x and $z are global macros defining relevant 
variable lists. 


The model parameters are estimated using the following equations: 
N 


Ne ERK @ = xi.) 
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The bottom equation gives the first-order conditions for logit regression of 
treatment D; on regressors Z;. The outcome D; = 1 equation gives the first- 
order conditions following weighted least-squares regression of the outcome 
Yi on regressors X;, where the weights are the inverse of the predicted 
probability (propensity score) from the logit model that D; = 1, with 
additional weight term N; /N. The outcome D; = 0 equation is a similar 
equation for untreated observations. The top equation is simply a rewriting 
of ATET = (1/N) X; x43, — (1/N) X; x4 Gp. and @ is the ATE estimate. 


Stacking all four equations yields the system of first-order conditions 


N 
Sis (vi, Di, xi, zi, & B1, 80,7) = 0 
i=l 


a 


or, more simply, (1/N) SÀ; s;(8) = 0 where 6 = (â, 84, 3y,4): 


This system is just identified, so the estimator is an estimating equations 
estimator, or, equivalently, a just-identified generalized method of moments 
estimator. Assuming that E{s(y;, Di, Xi, Zi, 0)} = 0, the estimator @ is 
consistent for @ and is asymptotically normally distributed. For independent 
observations, the heteroskedastic—robust variance—covariance matrix 
estimate is 


A cluster—robust version replaces the middle matrix with 
yas a 6:38; (@)s,;(@)’, where 6,; is an indicator variable equal to 1 if 


observations 7; and j are in the same cluster. 


Note that while we need E'{s(y, D, x,z,0)} = 0 for all parameters to be 
consistently estimated, the IPW-RA estimator of ATE is consistent even if one 


of the RA equation or the treatment probability equation is misspecified. 
Variations of this example include changing the first equation appropriately 
to obtain estimates of the ATET and Pom. For a multivalued treatment, say, 
with m treatments, the first equation will become (m — 1) equations (in the 
case of ATE and ATET), the second and third equations will become m 
equations, and the treatment probability model may be a multinomial logit 
(MNL) model. This is the method used by the tef fects commands; see 
Methods and Formulas of [TE] teffects aipw for a very general treatment. 


For models not covered by the teffects command, for example, using a 
complementary log—log model for the treatment probability, one can 
estimate each model sequentially and obtain standard errors by a two-step 
bootstrap, as in section 12.4. An alternative method to obtain the standard 
errors is to again use a “stacked” moment-based approach and use the gmm 
command; section 13.3.11 provides an example. 


The preceding analysis assumes that estimators are standard root- N 
consistent and asymptotically normal. For matching estimators, the 
asymptotics are nonstandard, and alternative methods are used to obtain the 
standard errors. 


24.7 Stata commands for treatment evaluation 


Stata has several commands for treatment evaluation. The starting point for 
models that satisfy unconfoundedness is the teffects commands. We 
additionally briefly summarize other commands for treatment evaluation. 


24.7.1 The teffects commands and telasso command 


Table 24.2 displays the standard commands for basic TE analysis for treatment 
when the treatment is exogenous. Additional details of implementation of each 
command follow in the next section along with examples. 


Table 24.2. Key basic TEs methods and commands 


Regression RA teffects ra 
IPW Inverse-probability weighting teffects ipw 
AIPW teffects aipw, telasso 
IPW-RA teffects ipwra 
Matching NNM teffects nnmatch 
PSM teffects psmatch 
Multiple treatments teffects multivalued 
Diagnostics Overlap plots teoverlap 
Covariate balance tebalance 


In this chapter, we focus on a continuous outcome, though the outcome can 
also be a count, a binary outcome, or an ordered discrete outcome. For binary 
or ordered discrete outcomes, the TE is measured as the change in probability 
due to treatment. 


An example of the teffects commands is the teffects ipwra command, 
which includes both RA and inverse-probability weighting. The command 


syntax is 


teffects ipwra (ovar omvarlist E omodel noconstant | ) 
(tvar tmvarlist E tmodel noconstant | ) [ of ] [ in | | weight | le stat options | 


Here ovar is the outcome variable, and omodel is the outcome model that can 
be linear, logit, probit, hetprobit(), poisson, flogit, fprobit, or 
fhetprobit (). The treatment variable is tvar, and tmodel can be logit, 
probit, Or hetprobit (). The statistic defined in stat can be ate (the default), 
atet, Or pomeans. The vce (vcetype) may be robust, cluster Clustvar, 
bootstrap, Or jackknife. The option aequations displays all equations 
estimated. Option pstolerance (#) sets the tolerance for satisfying the overlap 
assumption, with default value 10-5, and option osample() identifies 
observations that violate the overlap assumption. 


The teffects ra command has similar syntax, except that the treatment 
model is no longer relevant, and the teffects ipw command has similar 
syntax, except that the outcome model is no longer relevant. The teffects 
aipw command has similar syntax, except that the ATET is not identified in this 
case, and options nis and wn1s provide alternative methods other than the 
default of maximum likelihood for estimating the conditional means. 


The telasso command uses the lasso to select a subset of regressors from 
a wide set of potential regressors in obtaining AIPW estimates; see 
section 24.6.10. 


For the teffects psmatch command, the outcome model is not relevant. 
The treatment variable model can be logit, probit, Or hetprobit (), and the 
statistic defined in stat can be ate (the default) or atet. The nneighbor (#) 
option eas the number of matches per observation; the default is one. The 
caliper (#) option defines the maximum distance used to determine potential 
neighbors, and options pstolerance(#) and osample() are the same as for the 
teffects ipwra command. The ae standard errors (option vce (robust) ) 
are heteroskedastic—robust standard errors that allow Var(yo;|x,) and Var(y1;|x; 
) to vary with covariates and treatment level. These are based on two nearest 
neighbors; this can be changed using option vce (robust [, nn(#) ]). The 
option vce (iid) gives homoskedastic standard errors. 


The teffects nnmatch command has options similar to teffects 
psmatch with the following important variations. The treatment variable model 


is no longer relevant. Instead, the metric for matching is specified in the 
metric() option as mahalanobis (the default), ivariance, euclidean, or 
matrix matname. The ematch (varlist) option matches exactly on specified 
variables. biasadj (varlist) adjusts for bias that arises if matching is on more 
than one continuous variable. 


Regardless of the estimation method used, it is important to additionally 
use the tebalance and teoverlap commands where available to check for 
covariate balance and propensity-score overlap. We focus on the most common 
case of a binary treatment. A multivalued treatment example is given in 
section 24.10. In that case, the teffects commands aside from the matching 
estimator commands are available, and an MNL model is used for the propensity 
scores (option logit). 


Additional TE commands are eteffects and etregress for linear 
regression with endogenous treatment, etpoisson for Poisson regression with 
endogenous treatment, and stteffects for survival data with exogenous 
treatment. The last of these commands uses parametric survival models to 
control for censoring. 


24.7.2 ERM commands when treatment is exogenous 


Stata’s extended regression model (ERM) commands, summarized in 

section 23.7, are focused on parametric models with endogenous regressors and 
joint normal model errors; see section 25.3. These commands cover linear 
regression (eregress), probit regression (eprobit), ordered probit (eoprobit), 
and interval regression (eintreg). 


For discrete-valued treatment with exogenous treatment, the ext reat () 
option is used, and the estat teffects postestimation command provides the 
ATE estimate as default; options provide the estimated ATET and the Poo. If the 
model is fit with option vce (robust), then estat teffects gives 
unconditional standard errors that additionally allow for variation in the 
regressors; see section 13.7.9. 


Table 24.3 displays alternative ways of doing the same analysis. The left 
column gives the syntax for the teffects ra command, and the right column, 
for the corresponding ERM command, extreat () option, and subsequent estat 


teffects postestimation command. An endogenous treatment application 
using the eregress, entreat command is given in section 25.3.3. 


Table 24.3. teffects ra and ERM commands when treatment is exogenous 


teffects commands ERM command 


Linear regression with exogenous treatment 
teffects ra (y x) (D) eregress y x, extreat(D) 
estat teffects 


Binary probit with exogenous treatment 
teffects ra (y x, probit) (D) eprobit y x, extreat(D) 
estat teffects 


24.8 Oregon Health Insurance Experiment example 


We use a sample extract from the public use files of the OHIE to illustrate the 
concepts and methods presented in preceding sections using the relevant 
Stata commands. The experiment was a lottery whose winners were granted 
the option of applying for enrollment in the state Medicaid program, a state- 
run health insurance program that would increase access to healthcare. 


In this section, we summarize the experiment and the data analyzed. In 
the subsequent section, we analyze how the out-of-pocket cost of medical 
services received varied with whether the individual was a lottery winner. 


We compute the ATE of winning the lottery, easily done here because a 
simple framework of exogenous treatment assignment is appropriate for this 
RCT. The only complication is that winning the lottery varied with household 
size and the time of the lottery and the survey, so analysis should control for 
these variables. 


If all lottery winners enrolled in Medicaid, then this ATE would also 
measure the ATE of being in Medicaid. Not all lottery winners enrolled in 
Medicaid, however, so the ATE of winning the lottery needs to be viewed as 
an intention-to-treat effect if the ultimate treatment of interest is enrolling in 
Medicaid. The more difficult estimation of the ATE of enrolling in Medicaid 
is deferred to section 25.5. 


24.8.1 The OHIE 


The OHIE is an important modern example of an RCT or a social experiment. 
The background to this experiment has been covered in detail in NBER (n.d.) 
and Baicker et al. (2013). Here we provide only the essential details for 
interpreting the application that follows. 


The U.S. Medicaid program, which provides health insurance for low- 
income people, is run separately by each state. At the time of the experiment, 
the state of Oregon’s Medicaid program, the Oregon Health Program (OHP), 
was separated into two components. OHP Plus served the categorically 
eligible Medicaid population, while OHP Standard was an expansion program 


targeting low-income uninsured adults who were otherwise ineligible for 
OHP Plus. Because of budgetary constraints, oHP Standard was wound back 
and closed to new applicants in 2004, leading to significant attrition over the 
following four years. Facing an 80% decline in enrollments, in January of 
2008, the state determined to expand the program by an additional 10,000 
positions. 


Importantly for our purposes, the OHP anticipated demand for the 
program in excess of the available new positions, so it sought and received 
permission to assign selection by lottery. Such random assignment provides 
a rare opportunity to assess the impact of expanded health insurance 
coverage on a variety of health and financial outcomes within RCT design 
framework. 


Indeed, from February to March 2008, some 90,000 individuals 
registered for the lottery list. Over the next 6 months, the government 
conducted 8 waves of lottery draws, resulting in some 35,000 individuals 
being offered the opportunity to apply for OHP coverage. Note that the 
opportunity to apply was extended to all members of the selected 
individual’s household; thus, selection was random conditional on the 
number of household members in the lottery list. Approximately 35,000 
individuals from 30,000 unique households were selected, and of those 
approximately 30% were eligible and enrolled in Medicaid by the given 
deadlines. 


Following the treatment, researchers tracked lottery participant outcomes 
over the next 12 months with 3 mail surveys. We analyze data from the third 
of these mail surveys, which was undertaken in 7 mail-out waves 
approximately 12 months after the lottery (July and August 2009). Nearly all 
individuals selected in the lottery, as well as an approximately equal number 
of nonselected individuals, were mailed questionnaires regarding healthcare 
needs, experiences, and costs over the previous six months. Following an 
intensive follow-up protocol undertaken on a subset of nonresponders, the 
researchers achieved an estimated response rate of approximately 50%. 


In this chapter, we view the treatment as being selected by the lottery to 
be eligible for Medicaid. We include indicator variables capturing household 
size and survey wave to control for potential correlation with the probability 


of treatment. And we include a set of relevant covariates to improve 
efficiency, including variables relating to smoking status, income as a 
percentage of the federal poverty line, education level, and employment. 


Not all individuals selected by the lottery actually enrolled in Medicaid. 
In section 25.5, we view the treatment to be actual enrollment in Medicaid 
and obtain a local average TE estimate that uses selection by the lottery as an 
instrument for actual enrollment in Medicaid. In that context, the ATE that we 
estimate in the analysis below is called an intention-to-treat effect. 


24.8.2 Data summary 


The size and complexity of the OHIE dataset is reflected in data files in 
NBER (n.d.). Given our limited objective of illustrating the methods in this 
chapter, rather than carrying out a full-scale empirical study, we focus on a 
single outcome variable and limit the number and type of exogenous 
regressors that are used. 


The key data can be grouped into outcomes, treatments, and regressors, 
with the global z1ist referring to household variables and wave variables 
that may affect lottery participation and the global xlist referring to 
individual-specific variables that may affect the outcome variable. 


. * Variables: (1) outcomes (2) z: treatment related (3) x: outcome related 
. qui use mus224ohiesmallrecode, clear 


. global outcomes oop dowe ervisits // Outcome variables 

. global y oop // Outcome variable for this chapter 

. global hh dhhsize2 dhhsize3 // Household size dummies 

. global wave dlotdraw* dsurvdraw* // Lottery and survey draws 

. global zlist $hh $wave // z variables for the treatment 

. global xlist dsmoke hhinc deduc2-deduc4 demploy2-demploy4 // x for outcome 


A brief description of the variables used in the following analysis is as 
follows. 


. * Variable descriptions 
. describe $outcomes lottery medicaid household_id $zlist $xlist 


Variable 
name 


oop 
dowe 
ervisits 


lottery 
medicaid 


household_id 


dhhsize2 
dhhsize3 
dlotdraw2 
dlotdraw3 
dlotdraw4 
dlotdraw5 
dlotdraw6 
dlotdraw7 
dlotdraw8s 
dsurvdraw2 
dsurvdraw3 
dsurvdraw4 
dsurvdraw5 
dsurvdraw6 
dsurvdraw7 
dsmoke 
hhinc 


deduc2 
deduc3 
deduc4 
demploy2 
demploy3 
demploy4 


Storage 
type 


float 
byte 
byte 


byte 
byte 


float 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
byte 
float 


byte 
byte 
byte 
byte 
byte 
byte 


Display 
format 


%9. 
%9. 


%9. 


Og 
Og 


Og 


Value 
label 


yesno 


lottery 
enrolled 


smk_1bl 


Variable label 


Out of pocket costs 
(cost_tot_oop_mod_12m) 

Owe any money for health 
(cost_any_owe_12m owe_any) 

Emergency room visits 
(er_num_mod_12m) 

Selected in the lottery 

Ever enrolled in Medicaid from ist 
not. date (10mar2008) to 30sep2009 
(ohp_all_e 

Scrambled household identifier 

2 in hh (dddnumhh_1i_2) 

3 in hh (dddnumhh_1i_3) 

draw_lottery==2 (11lldraw_lot_2) 

draw_lottery==3 (11lldraw_lot_3) 

draw_lottery==4 (11lldraw_lot_4) 

draw_lottery==5 (11lldraw_lot_5) 

draw_lottery==6 (11lldraw_lot_6) 

draw_lottery==7 (11lldraw_lot_7) 

draw_lottery==8 (11lldraw_lot_8) 

draw_survey==2 (ddddraw_sur_2) 

draw_survey==3 (ddddraw_sur_3) 

draw_survey==4 (ddddraw_sur_4) 

draw_survey==5 (ddddraw_sur_5) 

draw_survey==6 (ddddraw_sur_6) 

draw_survey==7 (ddddraw_sur_7) 

Currently smoke cigs (smk_curr_12m) 

Household income as % of federal 
poverty line (hhinc_pctfpl_12m) 

HS diploma or GED (edu_12m_2) 

Voc or 2yr degree (edu_12m_3) 

Four year degree (edu_12m_4) 

Work < 20 hrs/wk (employ_hrs_12m_2) 

Work 20-29 hrs/wk (employ_hrs_12m_3) 

Work 30+ hrs/wk (employ_hrs_12m_4) 


The names in parentheses are the original names of the variables in the OHIE 
dataset. The suffix 12m refers to tracking of lottery participant outcomes 
over the next 12 months; the survey questions in fact covered healthcare 
needs, experiences, and costs over the previous 6 months. 


The continuous outcome variable analyzed is out-of-pocket expenditures 
(cop) in the past six months. Additional outcomes that we do not analyze, 
reserving them for exercises, are a binary outcome of whether one owes any 
money (dowe) and a count outcome of emergency room visits (ervisits). 


The treatment considered in this chapter is whether one wins the lottery 
(lottery), while the next chapter analyzes enrollment in Medicaid 
(medicaid). 


. * Summary statistics: Outcomes, treatments, and household size 
. summarize $outcomes lottery medicaid $hh 


Variable Obs Mean Std. dev. Min Max 
oop 22,679 269.0062 733.0821 (0) 9400 

dowe 22,476 .5542801 .497056 (0) 1 
ervisits 22,491 .4372416 .9812417 (0) 10 
lottery 22,679 .4972001 . 5000032 (0) 1 
medicaid 22,679 . 2812293 .4496091 (0) 1 
dhhsize2 22,679 . 2963094 . 4566392 (0) 1 
dhhsize3 22,679 .0025133 .0500713 (0) 1 


While 49.7% of lottery participants won the lottery, the treatment of this 
chapter, only 28.1% of the sample actually enrolled in Medicaid, a treatment 
analyzed in chapter 25. 


Although the state randomly sampled from individuals on the list, the 
entire household of any selected individual was considered selected and 
eligible to apply for oHP Standard. Thus, selected (treatment) individuals are 
disproportionately drawn from households of larger household size. 
Additionally, for the sample at hand, winning the lottery varied with the time 
of the lottery and the survey, as the following linear probability model 
indicates. 


* Lottery dummy varies with household size and lottery and survey waves 
. regress lottery $zlist, vce(cluster household_id) 


Linear regression 


lottery 


dhhsize2 
dhhsize3 
dlotdraw2 
dlotdraw3 
dlotdraw4 
dlotdraw5 
dlotdraw6 
dlotdraw7 
dlotdraw8 
dsurvdraw2 
dsurvdraw3 
dsurvdraw4 
dsurvdraw5 
dsurvdraw6 
dsurvdraw7 
_cons 


(Std. 


Coefficient 


.0651882 
. 2807319 
002456 
. 0006732 
.0541089 
0747288 
.0850831 
.0959815 
.0855691 
.0122423 
.0205117 
-. 1936876 
- . 2042628 
-.2686614 
-.3149375 
. 5820885 


Number of obs 
F(15, 20147) 


Prob > F 
R-squared 
Root MSE 


22,679 
93.29 
0.0000 
0.0677 
. 48294 


err. adjusted for 20,148 clusters in household_id) 


Robust 


std. err. 


. 0088219 
.0618745 
.0168842 
.016864 
.0164205 
.016741 
.0144081 
.0144576 
.0165617 
.0188461 
.0188443 
.0163184 
.0162951 
.0149923 
.0148982 
.011587 


m. 


-11. 
-12. 
-17. 
-21. 

50. 


On onrRWOO BN 


P>|t | 


oo0o00000000000000O 


. 000 
. 000 
. 3884 
. 968 
.001 
. 000 
. 000 
. 000 
. 000 
.516 
.276 
. 000 
. 000 
. 000 
. 000 
. 000 


[95% conf. 


.0478965 
. 1594529 
. 0306384 
.0323816 
.0219234 
.0419149 

. 056842 
. 0676434 
.0531068 
.0246976 
.0164247 
-.225673 
. 2362026 
. 2980475 
.3441391 

. 559377 


interval] 


. 0824798 
.402011 
.0355504 
. 0337279 
. 0862944 
. 1075426 
. 1133243 
. 1243197 
. 1180315 
.0491822 
.0574481 
-.1617021 
-. 1723231 
-. 2392753 
-. 2857359 
. 6048001 


Separate analysis of each variable shows that lottery success by household 
size ranged from 0.57 to 0.89, by lottery draw ranged from 0.48 to 0.53, and 
by survey ranged from 0.34 to 0.66. Given this, we expect fitted propensity 
scores to be approximately in the range 0.3 to 0.9. We use the variables in 
zlist as controls in the analysis below. Finkelstein et al. (2012) used a 
richer list that added interactions of the wave variables with the household 
size indicator variables. 


To illustrate the use of control variables that may increase the efficiency 
of estimators of an outcome model, the analysis below uses an indicator for 
smoker (dsmoke), household income (hhinc), and education and employment 
indicator variables. 


. * Summary statistics: xlist variables that may improve outcome model fit 
. summarize $xlist 


Variable Obs Mean Std. dev. Min Max 
dsmoke 22,154 2.262661 .9171565 1 3 
hhinc 20,478 76.97273 69.16905 (0) 461.6898 
deduc2 21,986 .4982716 . 5000084 (0) 1 
deduc3 21,986 . 2204585 .4145653 0 1 
deduc4 21,986 .1137997 .317575 (0) 1 
demploy2 22,411 .0912052 . 2879071 (0) 1 
demploy3 22,411 .0999509 . 2999412 (0) 1 
demploy4 22,411 . 2638436 . 4407254 (0) 1 


24.8.3 Initial regression analysis 


Before demonstrating the various teffects commands, we perform some 
initial analysis. 


The following quantile plots truncate oop at $2,000 (the 97th percentile) 
for readability. 


* Quantile plot of outcome variable by treatment status 
qui qplot $y if $y < 2000, over(lottery) clpattern(1 _) recast (line) 
ytitle("Out-of-pocket spending") legend(pos(11) ring(0) col(1)) 
title("Quantile plots for lottery") lwidth(medthick thick) 
xtitle("Fraction of the data") 
lpattern(solid dash) 


VVVMVMs 


qui qplot $y if $y < 2000, over(medicaid) clpattern(1 _) recast(line) 
ytitle("Out-of-pocket spending") legend(pos(11) ring(0) col(1)) 
title("Quantile plots for Medicaid") lwidth(medthick thick) 
xtitle("Fraction of the data") 
lpattern(solid dash) 


VVVMVMs 


The first panel of figure 24.1 is very similar to the plot given in 
Finkelstein et al. (2012, 1093). About half the sample had zero out-of-pocket 
expenditures. In principle, the outcome could be modeled using a two-part 
model; instead, an RA uses the linear model. The out-of-pocket expenditures 
are lower for lottery winners. The difference is even greater for those who 
enroll in Medicaid; this is studied in chapter 25. 
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Figure 24.1. Quantile plots of TE 


A starting point is OLS regression of the outcome on a treatment dummy 
and some additional control variables, where in the most general case, 


Yin = Bo + Bitreat, + Zin 82 + Xin b3 + Ein 


where ; is an individual subscript, h is the household subscript, z are control 
variables for the treatment assignment, necessary to ensure that assumption 1 
in section 24.5 is satisfied, and x are control variables related to the outcome 
that may improve the precision of estimates. 


We fit OLS regressions of cop on lottery with increasing sets of controls. 


* Regression of outcome on treatment with various controls 
qui regress $y lottery, vce(robust) 


estimates store Diff_rob 
qui regress $y lottery, vce(cluster household_id) 
estimates store Diff_clu 
. qui regress $y lottery $zlist, vce(cluster household_id) 
estimates store zlist 
qui regress $y lottery $xlist, vce(cluster household_id) 
estimates store xlist 
qui regress $y lottery $zlist $xlist, vce(cluster household_id) 


estimates store Both 


. estimates table Diff_rob Diff_clu zlist xlist Both, keep(lottery) 
> b(78.4f) se stat(N r2) 


Variable Diff_rob Diff_clu zlist xlist Both 


lottery | -44.6627 -44.6627 -40.9243 -45.7420 -40.1693 


9.7294 9.9216 10.1302 10.6484 10.8169 
N 22679 22679 22679 19393 19393 
r2 0.0009 0.0009 0.0018 0.0147 0.0154 


Legend: b/se 


The results are consistent across the regressions, with a highly statistically 
significant reduction of $40-$45 on a base of $270. As already explained, 
the preferred estimates include the z variables as controls. 


The first estimates are essentially the same as those from ttest oop, 
by (lottery) unequal. The second estimates use cluster—robust standard 
errors that cluster on household, something done in all the subsequent 
analysis, aside from matching estimators, leading to slightly higher standard 
errors. The third estimates add various lottery and survey variables as 
controls. The fourth estimates included additional variables likely to be 
correlated with the outcome, leading to a loss of 3,286 observations due to 
missing data. The fifth combines both sets of controls. The control variables 
have very low explanatory power. 


24.9 Treatment-effect estimates using the OHIE data 


The article by Finkelstein et al. (2012, 1093) and the supplemental material 
provide an excellent benchmark for evaluation of an RCT. The impact of 
lottery on various outcomes was estimated by regression of the outcome on 
lottery, and an expanded list of the lottery variables given in zlist. 
Because this was a well-designed RCT, more complex analysis was 
unnecessary. 


For pedagogical purposes, we demonstrate the more complex analysis. 
This becomes essential when data are from an observational study, rather 
than an RCT, though is valid in an observational study only if sufficient 
controls are included to make the assumption of selection on observables 
only a reasonable assumption. 


We consider in turn regression imputation methods, Pw methods, and 
matching methods. For brevity, we generally estimate only the ATE, the 
default, and use default computer output that simply provides the ATE. The 
options atet and pomeans provide estimates of the ATET and Poms, and the 
aequations option lists the estimates of the underlying regression models. 


For this RCT, we expect the various methods to lead to an estimated ATE 
similar to that from simple OLS regression on lottery and the z variables. In 
other applications with observational data, we expect bigger departures from 
simple OLS regression, though in an ideal world, we expect similar answers 
across the various treatment evaluation methods. 


24.9.1 RA estimates 


We perform RA analysis (see section 24.6.1), with controls both the 
necessary treatment assignment controls (z1ist) and additional controls that 
may increase estimator precision (xlist). The default teffects ra 
command provides the estimated ATE. The option aequat ions additionally 
provides the estimates of the two underlying OLS regressions. 


. * Regression-adjusted ATE using $zlist and $xlist 
. teffects ra ($y $xlist $zlist ) (lottery), nolog vce(cluster household_id) 


Treatment-effects estimation Number of obs = 19,393 
Estimator : regression adjustment 
Outcome model : linear 


Treatment model: none 
(Std. err. adjusted for 17,348 clusters in household_id) 


Robust 
oop | Coefficient std. err. z P>lz| [95% conf. interval] 
ATE 
lottery 
(Selected 
vs 
Not selected) -40 . 44588 10.94579 -3.70 0.000 -61.89924 -18.99252 
POmean 
lottery 
Not selected 292.1394 7.693716 37.97 0.000 277.06 307 .2188 


The ATE estimate and its standard error are very similar to the estimates of 
— 40.17 and 10.82 obtained earlier by the simpler single OLS regression. 


For completeness, we also obtain the estimated Poms. 


. x Regression-adjusted POMs using $zlist and $xlist 
. teffects ra ($y $xlist $zlist) (lottery), pomeans nolog vce(clu household_id) 


Treatment-effects estimation Number of obs = 19,393 
Estimator : regression adjustment 
Outcome model : linear 


Treatment model: none 
(Std. err. adjusted for 17,348 clusters in household_id) 


Robust 
oop | Coefficient std. err. z P>lzl [95% conf. interval] 
POmeans 
lottery 
Not selected 292.1394 7.693716 37.97 0.000 277 .06 307.2188 
Selected 251.6935 7.831481 32.14 0.000 236.3441 267 .0429 


The difference 251.6935 — 292.1394 = —40.4459 is the preceding ATE 
estimate. By comparison, the raw means of the data by lottery status for this 
sample with N = 19393 equal 291.2125 and 246.5498. 


We next obtain the ATET. 


. * Regression-adjusted ATET using $zlist and $xlist 
. teffects ra ($y $xlist $zlist) (lottery), atet nolog vce(cluster household_id) 


Treatment-effects estimation Number of obs = 19,393 
Estimator : regression adjustment 
Outcome model : linear 


Treatment model: none 
(Std. err. adjusted for 17,348 clusters in household_id) 


Robust 
oop | Coefficient std. err. Zz P>|zl [95% conf. interval] 
ATET 
lottery 
(Selected 
vs 
Not selected) -35.98451 11.10953 -3.24 0.001 -57 . 75878 -14.21024 
POmean 
lottery 
Not selected 288.2787 8.271055 34.85 0.000 272.0678 304.4897 


The estimated ATET is — 35.98 compared with the ATE estimate of — 40.45. 
The difference between the two reflects in part the particular composition of 
the lottery winner group compared with all lottery participants. Significant 
differences in the impact of control variables on the respective outcomes of 
the treated and untreated may also partially account for difference between 
the two TEs. 


24.9.2 IPW estimates 


The Pw method is presented in section 24.6.2. The default teffects ipw 
command obtains weights based on a logit model. Here the regressors are 
zlist variables associated with the randomization design, so the sample size 
is larger than the preceding example where xlist variables were also used. 


. * IPW ATE using $zlist 
. teffects ipw ($y) (lottery $zlist), nolog vce(cluster household_id) 


Treatment-effects estimation Number of obs = 22,679 
Estimator : inverse-probability weights 
Outcome model : weighted mean 


Treatment model: logit 
(Std. err. adjusted for 20,148 clusters in household_id) 


Robust 
oop | Coefficient std. err. z P>lz| [95% conf. interval] 
ATE 
lottery 
(Selected 
vs 
Not selected) -39.56951 10.12338 -3.91 0.000 -59.41098 -19.72804 
POmean 
lottery 
Not selected 286 .0328 7.275387 39.32 0.000 271.7733 300.2923 


The ATE estimate of — 39.57 is similar to the estimates from the regression- 
adjusted model, despite a difference in sample size, and the confidence 
interval [ — 59.4, —19.7] is slightly narrower because of a lower standard 
error. 


24.9.3 Propensity-score overlap 


The propensity score should be evaluated for two reasons. First, an 
important assumption is the overlap assumption that for each value of the 
control variables, there are both treated and untreated cases. Second, if the 
propensity score for an observation is very close to 0 or very close to 1, then 
IPw methods can give that observation a very high weight. 


The postestimation command teoverlap provides a kernel density 
estimate plot for the propensity score at the different values of the treatment 
variable. Ideally, the two densities are very similar. The default is to 
calculate predicted probabilities for the lowest treatment level, here D = 0. 
We add the option pt level (1) so that the predicted probabilities are instead 
those of the propensity score, Pr(D = 1|z). We have 


* Overlap assumption - graph using teoverlap 
. teoverlap, ptlevel(1) kernel(triangle) bw(0.04) 
> title("Kernel density overlap") xtitle("Propensity score") 
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Figure 24.2. Propensity-score overlap 


The first panel of figure 24.2 plots the two propensity-score densities. 
The propensity scores appear to be mostly in the range 0.25 to 0.75, so they 
are away from the boundaries of 0 and 1. For lottery winners, the lighter 
line, the lowest propensity score is only 0.3, yet many lottery losers, the 
darker line, have a lower propensity score. And there is a slight bump for 
lottery winners around 0.9. The plots suggest that the overlap is essentially 
in the range 0.3 to 0.75. Analysis might best be restricted to that range, 
leading to a loss of 872 observations or 3.8% of the sample, or at least 
restricted to that range in a robustness test. 


The predict postestimation command following the teffects ipw 
command computes the predicted probability that D = 0, so the propensity 
score can be computed as one minus this estimate. Equivalently, we predict 
following the logit command. 


. * Overlap assumption - manual graph using logit prediction and histograms 
. qui logit lottery $zlist, vce(cluster household_id) 


. qui predict lothat 


. Summarize lothat 


Variable Obs Mean Std. dev. Min Max 
lothat 22,679 .4972001 . 1301477 .2716641 . 9257978 
. twoway (hist lothat if lottery==0, fcol(white)) (hist lothat if lottery==1), 
> title("Histogram overlap") xtitle("Propensity score") 
> legend(label(1 "lottery=Not selected") label(2 "lottery=Selected") ) 


The Pw weights will not take extreme values, even if all the sample is used, 
because the propensity scores range from 0.272 to 0.926. This range does 


not really change if a more flexible model is used by fully interacting the 
variables in zlist. As explained in section 24.8, for this RCT example, 
propensity scores are expected to be a considerable amount away from the 
boundary values of 0 and 1. For an observational study with a very flexible 
binary outcome, model extreme values may be more likely to occur. 


The second panel of figure 24.2 plots histograms of the propensity scores 
by lottery status and leads to the same conclusions as the first panel. If 
propensity scores are felt to not overlap or are felt to be too close to 0 or 1, 
leading to concerns about IPw estimator instability, one can trim using the 
estimated propensity scores and the if qualifier. For example, give 
command teffects ipw (Sy) (lottery $zlist) if (lothat>0.3) & 
lothat<0.75). 


24.9.4 Regressor balance 


Regressor balance is essential. The tebalance postestimation commands are 
available after all teffects commands that use propensity scores or 
matching but are not available after the teffects ra command. 


Here we check balance following the teffects ipw command, beginning 
with the tebalance summarize command. 


* Covariate balance summary 
. qui teffects ipw ($y) (lottery $zlist), nolog vce(cluster household_id) 


. tebalance summarize 


Covariate balance summary 


Raw Weighted 
Number of obs = 22,679 22,679.0 
Treated obs = 11,276 11,151.4 
Control obs = 11,403 11,527.6 


Standardized differences 


Variance ratio 


Raw Weighted Raw Weighted 

dhhsize2 . 1939117 . 0060577 1.19006 1.005543 
dhhsize3 .0797051 .0021402 8.56137 1.04358 
dlotdraw2 .0331758 .0015502 1.089294 1.004479 
dlotdraw3 .0365961 0046157 1.100489 1.013669 
dlotdraw4 .0257401 .0069954 . 9342648 .9817981 
dlotdraw5 . 0003948 . 0069064 1.001076 . 9816374 
dlotdraw6 . 0083996 .0028377 . 9873158 .9960122 
dlotdraw7 .0192988 0054252 970481 1.007623 
dlotdraw8 0254451 .0079989 -9341439 1.020275 
dsurvdraw2 . 2260162 .0055742 1.772475 . 9868448 
dsurvdraw3 . 2397501 .0094081 1.838261 . 9780227 
dsurvdraw4 .0404115 .0106734 .9196795 1.022566 
dsurvdraw5 .0458372 .0091592 .9098577 1.019404 
dsurvdraw6 . 1878871 .0052308 . 7566851 1.007905 
dsurvdraw7 -.290524 .0054474 .5950349 . 9903239 


The first part of the output shows that after weighting, the effective number 
of treated and control observations is similar to that in the raw data. In other 
settings, especially with observational data, there can be much bigger 
differences. 


The output then shows both the raw and weighted differences that were 
defined in section 24.6.4. The raw difference measure, calculated variable by 
variable, refers to differences before applying the inverse-probability 
weights, while weighted differences refer to the differences after applying 
inverse-probability weighting. The weighted differences are small, and most 
of the variance ratios are close to 1, indicating that weighting has corrected 
the lack of balance in the raw data, especially that in variables dhhsize3, 
dsurvdrawé6, and dsurvdraw7. 


The tebalance density command provides diagnostic plots for 
individual variables that contrast a kernel density plot for the raw variable 
with the kernel density plot for the weighted variable. 


Following the teffects nnmatch and teffects psmatch commands, the 
tebalance box command provides diagnostic box plots for individual 
variables. 


Following the teffects ipw, teffects aipw, and teffects ipwra 
commands, the tebalance overid command provides a formal test of 
restrictions imposed by the balance requirement. For the current example, 
we obtain 


. * Formal test of balancing 
. tebalance overid 


Iteration 0: criterion = .00024449 
(output omitted ) 
Iteration 192: criterion = .00278518 (backed up) 


Overidentification test for covariate balance 
HO: Covariates are balanced: 


chi2(16) = 358.044 
Prob > chi2 = 0.0000 


The null hypothesis of balance is rejected, though in general, with a very 
large sample size, almost any specification test is likely to reject at a fixed 
level of significance such as 0.05. The reasonable balance after weighting 
suggested by the tebalance summary command may be a better guide. We 
revisit balance in section 24.9.7. 


24.9.5 AIPW and IPW-RA estimates 


We next present the two doubly robust hybrid methods detailed in 
section 24.6.5 that combine RA and inverse-probability weighting. 


For the AIPw estimator, we add the aequations option to the teffects 
aipw command to obtain output that in addition to the estimated ATE includes 
the estimates for the two regressions for the PoM and the logit regression for 
the propensity score. 


. * Augmented IPW ATE 
. teffects aipw ($y $xlist) (lottery $zlist), aequations nolog 
vce(cluster household_id) 


> 


Treatment-effects estimation 
Estimator 

Outcome model 
Treatment model: logit 


: augmented IPW 
: linear by ML 


Number of obs = 


19,393 


(Std. err. adjusted for 17,348 clusters in household_id) 
Robust 
oop | Coefficient std. err. Zz P>lz| [95% conf. interval] 
ATE 
lottery 
(Selected 
vs 
Not selected) -38.66964 10.79616 -3.58 0.000 -59.82973 -17.50956 
POmean 
lottery 
Not selected 290.2338 7.720553 37.59 0.000 275.1018 305.3658 
OMEO 
dsmoke . 9205123 8.662252 0.11 0.915 -16.05719 17.89821 
hhinc 1.005393 . 1398412 7.19 0.000 . 731309 1.279477 
deduc2 8.705563 21.02138 0.41 0.679 -32.49559 49.90671 
deduc3 78.9117 25.07941 3.15 0.002 29.75696 128.0664 
deduc4 61.82869 29.22027 2.12 0.034 4.558018 119.0994 
demploy2 -16.27391 26.17013 -0.62 0.534 -67 .56643 35.01861 
demploy3 -27 . 16429 24.70588 -1.10 0.272 -75.58692 21.25834 
demploy4 -27.29717 20.0837 -1.36 0.174 -66 . 66049 12.06615 
_cons 199.0383 25.9641 7.67 0.000 148.1496 249.927 
OME1 
dsmoke 23.37641 8.211131 2.85 0.004 7.282887 39.46993 
hhinc 1.1747 . 1409721 8.33 0.000 . 8983998 1.451 
deduc2 -26.81169 22.21343 -1.21 0.227 -70.34921 16.72582 
deduc3 38.89857 26.51898 1.47 0.142 -13.07768 90.87481 
deduc4 17.02745 33.1716 0.51 0.608 -47 . 98769 82.04259 
demploy2 -59.06455 21.31782 -2.77 0.006 -100.8467 -17 .2824 
demploy3 -74.72525 21.30281 -3.51 0.000 -116.478 -32.97251 
demploy4 -18.49857 20.74892 -0.89 0.373 -59.1657 22.16856 
_cons 127.8238 25.06627 5.10 0.000 78.69479 176.9528 


TME1 
dhhsize2 
dhhsize3 
dlotdraw2 
dlotdraw3 
dlotdraw4 
dlotdraw5 
dlotdraw6 
dlotdraw7 
dlotdraw8 
dsurvdraw2 
dsurvdraw3 
dsurvdraw4 
dsurvdraw5 
dsurvdraw6 
dsurvdraw7 
_cons 


. 2712528 
1.671956 
. 0329293 
-.0134305 
. 2419665 

. 3180178 

. 3624306 
437767 

. 3686761 
.0720634 
.0894928 
-.8057285 
-.8464725 
-1.117297 
-1.319487 
. 3340067 
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.077133 
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. 1933239 

. 300013 
. 1273357 
. 1703366 
. 0907886 
. 1649021 
. 2289176 
. 3030318 

. 216376 
-.110376 
.0915468 
. 9554703 
. 9963147 
1.257971 
-1.461543 

. 2291829 


. 3491817 
3.0439 

. 1931943 
. 1434756 
. 3931443 
.4711335 
. 4959436 
. 5725023 
. 5209762 
. 2545028 
. 2705323 
-. 6559866 
-. 6966302 
-.9766228 
-1.177431 
. 4388306 


The ATE estimate of — 38.67 is slightly lower than that from the ipw output. 


IPW-RA using defaults for the teffects ipwra command yields the 
following result: 


. * IPW with RA ATE 
. teffects ipwra ($y $xlist $zlist) (lottery $zlist), nolog 
> vce(cluster household_id) 


Treatment-effects estimation Number of obs = 
Estimator : IPW regression adjustment 
Outcome model : linear 


Treatment model: logit 


19,393 


(Std. err. adjusted for 17,348 clusters in household_id) 
Robust 
oop | Coefficient std. err. z P>|z| [95% conf. interval] 
ATE 
lottery 
(Selected 
vs 
Not selected) -39.93113 10.93502 -3.65 0.000 -61.36337 -18.49888 
POmean 
lottery 
Not selected 291.8993 7.686289 37.98 0.000 276.8344 306.9641 


The ATE estimate of — 39.93 is very similar to — 38.67 obtained using APW. 


The Arpw can also be obtained using the telasso command, which used 
the lasso to select a subset of many potential control variables; see 
section 28.9.4. 


24.9.6 PSM estimates 


The teffects psmatch command implements PsM; see section 24.6.7. The 
matches are created using Stata’s default option. Default heteroskedastic— 
robust standard errors are used because obtaining standard errors for 
matching estimators is a nonstandard inference problem, and the 

vce (cluster clustvar) option is unavailable for matching estimators. 


. * PSM ATE 
. teffects psmatch ($y) (lottery $zlist) 
Treatment-effects estimation Number of obs = 22,679 
Estimator : propensity-score matching Matches: requested = 1 
Outcome model : matching min = 1 
Treatment model: logit max = 1018 
AI robust 
oop | Coefficient std. err. z P>|z| [95% conf. interval] 
ATE 
lottery 
(Selected 
vs 
Not selected) -48.935 14.08062 -3.48 0.001 -76.53251 -21.33749 


The resulting ATE estimate of — 48.94 is about 20% larger in absolute value 
than the preceding estimates. 


24.9.7 Blocking and stratified estimate 


As already noted, the teoverlap graph clearly suggests that overlap is not 
uniformly good over the range of propensity scores. This provides added 
incentive to use blocking, presented in section 24.6.5. We use the 
community-contributed pscore command (Becker and Ichino 2002) because 
the teffects commands do not provide an option for blocking. The pscore 
command provides a detailed balancing test by first blocking the 
observations by ranges of the propensity score and then testing for regressor 
balance within each block. The Stata tebalance command instead uses all 


observations, though one could manually form blocks and apply the 
tebalance command to each block. 


The pscore command algorithm proceeds as follows. The propensity 
score is estimated using a probit (the default) or logit model, and the fitted 
propensity-score values are sorted in ascending order. The sample is split 
into k equally spaced intervals (blocks) of the ordered propensity score, 
where & takes default value 5 or a user-specified value (here 8). Within each 
interval, we test that the average propensity score of treated and control units 
does not differ and split the interval into two if the test fails in that interval 
and test again. The procedure stops when in all intervals the average 
propensity score of treated and control units does not differ. Separate 
balancing tests of equality between treated and control units are performed 
for each regressor within each block. 


We use the blockid() and pscore() options to save results for 
subsequent estimation of the ATE. We suppress the lengthy output, which 
includes 


. * Blocking and subsequent balancing tests using community-contributed pscore 
> command 
. pscore lottery $zlist, logit blockid(myblock) pscore(pshat) numblo (8) 


(output omitted ) 
The final number of blocks is 43 


This number of blocks ensures that the mean propensity score 
is not different for treated and controls in each blocks 


(output omitted ) 
The balancing property is not satisfied 
Try a different specification of the propensity score 


(output omitted ) 


From output not reproduced here, there are 38 blocks that range in size from 
57 to 2,830 observations. Of the 550 distinct block-regressor combinations, 
25 are unbalanced, leading to the warning message. 


The pscore package includes commands that implement various 
matching methods; some are not covered by the teffects matching 
commands. It also includes the atts command, which obtains a stratified 


estimate of ATE that is a weighted average of the ATEs in each block. We 
obtain 


* PSM ATE using blocking and community-contributed atts command 
atts oop lottery, blockid(myblock) pscore(pshat) 


ATT estimation with the Stratification method 
Analytical standard errors 


n. treat. n. contr. ATT Std. Err. t 


11276 11403 -34.860 13.552 -2.572 


The estimated ATE is — 34.86, a little lower in absolute value than the 
preceding estimates, which ranged from — 38.67 to — 48.94. 


24.9.8 NNM estimates 


For NNM, detailed in section 24.6.9, we split the variables in z1ist into the 
wave variables, for which matching is based on Mahalanobis distance, and 
household size, for which matching is exact. Because only a few households 
had more than two members, we combine those households with two- 
member households. The teffects nnmatch command yields 


. * NNM ATE 
. generate dhhbig = dhhsize2 + dhhsize3 


. teffects nnmatch ($y $wave) (lottery), ematch(dhhbig) metric(mahalanobis) 


Treatment-effects estimation Number of obs = 22,679 
Estimator : nearest-neighbor matching Matches: requested = 1 
Outcome model : matching min = 5 
Distance metric: Mahalanobis max = 1018 
AI robust 

oop | Coefficient std. err. Zz P>|z| [95% conf. interval] 

ATE 
lottery 
(Selected 

vs 

Not selected) -48 .68898 13.83561 -3.52 0.000 -75.80629 -21.57168 


The estimated ATE is — 48.69, compared with the preceding estimates, which 
ranged from — 38.67 to — 48.94. 


24.10 Multilevel treatment effects 


The preceding discussion has focused on a binary setup with just one level 
of treatment that is either received or not received. Forcing treatment 
evaluation into a binary framework may lead to aggregation of a collection 
of heterogeneous treatments. For example, contrasting the uninsured with 
the insured population would be an oversimplification if there are many 
types of insurance plans that are heterogeneous with respect to coverage, 
premiums, deductibles, benefits, and so forth. A multilevel treatment (MLT) 
model allows greater homogeneity of treatment within each level of 
treatment. 


In a multilevel model, each subject receives one of several mutually 
exclusive treatments. The treatment levels may be ordered according to 
some criterion. The treatments may be exogenous or self-selected. For 
example, in a double-blind drug trial, there may be several alternative 
dosages, and each level of intensity can be considered a separate treatment. 
In a job training setting, the participants may be subjected to or choose 
between training periods of different lengths and intensity. TEs then refer to 
the impact of the treatment relative to not receiving any treatment or 
receiving one of the other treatments in the available set. 


24.10.1 Multivariate TEs methods 


We extend the RA approach of binary treatment to the multilevel case. 
Assume that there are m mutually exclusive treatments and one of the 
alternatives is the base level. Regardless of whether treatments are ordered 
or unordered, there can be pairwise comparisons between adjacent levels of 
treatment or even between any other pair; in each case, the base level can 
change. 


An RA regression model for the outcome y is 


Yij = a5 Dij + xj 8; Fep t7=1,...,Nj, jg=l,...,m 


where D;,; denotes the binary indicator for the jth treatment and x does not 
include an intercept. 


The assumptions regarding treatment assignment are those that were 
used in the binary case. We assume individualistic treatment, probabilistic 
treatment, and conditional independence. Hence, we again make the 
selection-on-observables assumption that F'(é;,|x;, Dj; ) = 0. 


The objective is the estimation of TEs based on a paired comparison of 
the potential outcomes of those who are assigned treatment 7, the treated 
group, with those assigned to treatment k, the control group. That is, the ATE 
for the (j, k)th pair is 


ATE), k) = Big | xi, Dy = 1) — E (Yir|Xi Din = 1) 
= x; (B; — Bk) + (aj — ak) 


An immediate implication is that MLT involves more parameters, more 
counterfactuals, and more computations to support pairwise comparisons. 
The binary treatment methodology can still be used for specified pairwise 
comparisons. However, in general, the MLT framework will deliver greater 
efficiency; see Cattaneo (2010), who establishes consistency and normality 
of a class of ATE estimators. 


Stata’s teffects commands can be used to model MLT data. The 
commands teffects ra, teffects ipw, teffects ipwra, and teffects aipw 
extend to MLTs and are illustrated in the next section. Currently, the 
teffects psmatch and teffects nnmatch commands are not available for 
MLT. 


24.10.2 Multivalued TEs application 


For illustration, we use data on annual prescription drug expenditures of the 
elderly Medicare population in the United States in 2003 and 2004 derived 
from the Medicare Current Beneficiary Survey. At that time, Medicare did 
not cover prescription drug expenditures, unless one was in the hospital. 


The data used here are a subset of the data used in Li and Trivedi (2016). 
Prescription drug coverage can be obtained through various privately 


obtained sources, including Medicare supplemental insurance plans 
(Medigap), Medicare managed care plans (mmc), and employer-sponsored 
plans (Est) that some retirees can still access. Our sample includes 
individuals with prescription drug insurance from these three sources, as 
well as a control group of Medicare-only (Medicare) elderly with no 
coverage for prescription drugs. The objective is to estimate TEs of the three 
levels of insurance (insleve1) that cover prescription drugs compared with 
Medicare-only (Medicare) coverage. 


The dependent variable drugexp is total annual drug expenditure. 


* Outcome - drug expenditure 
qui use mus224mcbs, clear 


generate drugexp = aamttot 
summarize drugexp 


Variable | Obs Mean Std. dev. Min Max 


drugexp | 7,664 1689.965 1890.913 0 54933.82 


The distribution is very right skewed, leading to a standard deviation greater 
than the mean, and 94% of the sample has positive drug expenditures. 


The treatment variable cleve1 is the type of insurance. We order this by 
generosity to ease interpretability of results, though this makes no difference 
to the empirical results because the methods used in this section treat the 
insurance categories as unordered. 


. * Create treatment variable clevel - insurance ordered by increasing generosity 


. qui generate clevel= 0 if coverage == "Medicare" // Medicare 
. qui replace clevel = 1 if coverage == "MMC" // MMC (Medicare managed plan) 
. qui replace clevel = 2 if coverage == "Medigap" // Medigap (Medicare suppl ins) 
. qui replace clevel = 3 if coverage == "ESI" // ESI (employer-sponsored) 
. tabulate clevel 
clevel Freq. Percent Cum. 
(0) 1,054 13.75 13.75 
1 1,170 15.27 29.02 
2 2,354 30.72 59.73 
3 3,086 40.27 100.00 
Total 7,664 100.00 


We contrast mean expenditure across insurance by running an OLS 
regression of expenditure on insurance level. 


. * Variation in mean drug expenditure by type of insurance 
. generate dmedicare = clevel == 


. generate dmmc = clevel == 
. generate dmedigap = clevel == 
. generate desi = clevel == 


. regress drugexp dmmc dmedigap desi, noheader vce(robust) 


Robust 
drugexp | Coefficient std. err. t P>|t| [95% conf. interval] 
dmmc 185.7296 55.95032 3.32 0.001 76.05164 295.4075 
dmedigap 441.2021 47.57151 9.27 0.000 347 . 949 534.4553 
desi 1283.898 56.569 22.70 0.000 1173.007 1394.789 
_cons 1009.119 37 .82362 26.68 0.000 934.9748 1083.264 


Average drug expenditures for those with only Medicare are $1,009. 
Compared with this base category, the average drug expenditures are $186 
higher for Medicare managed plans, $441 higher for Medigap plans, and 
$1,284 higher for employer-sponsored plans. 


24.10.3 RA estimates 


We use age, gender, ethnicity, income, and health status (measured on a 1 to 
5 scale) as regressors. 


* Variation in mean drug expenditure by type of insurance 
. global xlist h_age h_male h_white income_c genhelth 


summarize $xlist 


Variable 


Obs 


Mean 


Std. dev. 


Min 


h_age 7,664 TT .22599 7 .239356 65 104 
h_male 7,664 . 4579854 . 4982642 0 1 
h_white 7,664 . 9060543 . 2917722 0 1 
income_c 7,664 29855.9 45852.22 0 2000000 
genhelth 7,664 2.624478 1.060683 1 5 


Because the data are so right skewed, we view a better model for the 
conditional mean of drug expenditures to be an exponential model rather 
than a linear model. So we use the poisson option of the teffects ra 


command rather than the default of oLs. This does not require that the data 
be Poisson distributed (see section 20.2.4) and has the advantage compared 
with a log-linear model of including observations with zero drug 
expenditures. 


. * RA ATE 

. teffects ra (drugexp $xlist, poisson) (clevel), nolog 

Treatment-effects estimation Number of obs = 7,664 
Estimator : regression adjustment 

Outcome model : Poisson 


Treatment model: none 


Robust 
drugexp | Coefficient std. err. z P>Izl [95% conf. interval] 
ATE 
clevel 
(1 vs 0) 157.6892 56.31752 2.80 0.005 47 .30889 268.0695 
(2 vs 0) 365.7971 48.27608 7.58 0.000 271.1777 460.4165 
(3 vs 0) 1200.963 54.23313 22.14 0.000 1094.668 1307 .258 
POmean 
clevel 
0 1065.266 39.99379 26.64 0.000 986.8801 1143.653 


The ATE is estimated for each insured group relative to the Medicare (no drug 
insurance) benchmark. As expected, the ATE is the largest for the Est 
category ($1,201) and the smallest for the mmc category ($158). The three 
estimates are statistically significantly different at level 0.05 from the 
benchmark Medicare only, and their respective 95% confidence intervals do 
not overlap. 


The estimates are similar to those found by the simpler OLS on dummies 
regression given earlier, with $1,009 for Medicare-only and, respectively, 
$186, $441, and $1,284 for the three types of prescription drug coverage. 


To generate the ATET (ATE conditional on receiving treatment), we add the 
atet option to the previous command. 


* RA ATET 
. teffects ra (drugexp $xlist, poisson) (clevel), atet nolog 


Treatment-effects estimation Number of obs = 7,664 
Estimator : regression adjustment 
Outcome model : Poisson 


Treatment model: none 


Robust 
drugexp | Coefficient std. err. z P>Izl [95% conf. interval] 
ATET 
clevel 
(1 vs 0) 181.5681 54.92571 3.31 0.001 73.91567 289.2205 
(2 vs 0) 379.1871 47 . 39422 8.00 0.000 286.2961 472.0781 
(3 vs 0) 1187.521 53.12349 22.35 0.000 1083.401 1291.641 
POmean 
clevel 
(0) 1013.281 40.03075 25.31 0.000 934.8222 1091.74 


The resulting point and interval estimates are not substantially different from 
the ATE estimates. A significant difference between ATE and ATET estimates 
would suggest that self-selection into treatment may be an important factor 
at work. 


To generate the PoM for each category, we add the pomeans option to the 
main command. 


. * RA POMs 
. teffects ra (drugexp $xlist, poisson) (clevel), pomeans nolog 
Treatment-effects estimation Number of obs = 7,664 
Estimator : regression adjustment 
Outcome model : Poisson 
Treatment model: none 
Robust 
drugexp | Coefficient std. err. Zz P>|zl [95% conf. interval] 
POmeans 
clevel 
0 1065.266 39.98537 26.64 0.000 986.8966 1143.636 
1 1222.956 40.26935 30.37 0.000 1144.029 1301.882 
2 1431.064 28.2025 50.74 0.000 1375.788 1486 .339 
3 2266.23 38.39453 59.02 0.000 2190.978 2341 .482 


The pairwise comparisons reported above can also be obtained by 
generating the contrasts between levels of treatment, using the contrast 
command. 


The command contrast r.clevel generates a table of ATEs relative to 
the base-level category that can be either the default category or a category 
specified using the base () option. 


* Contrasts of TEs after RA compared with base category 
contrast r.clevel, nowald 
warning: cannot perform check for estimable functions. 


Contrasts of marginal linear predictions 


Margins: asbalanced 


Contrast Std. err. [95% conf. interval] 
POmeans 
clevel 
(1 vs 0) 157.6892 56.28316 47 .37623 268.0022 
(2 vs 0) 365.7971 48.27266 271.1844 460.4097 
(3 vs 0) 1200.963 54.231 1094.672 1307 .254 


The alternative command contrast ar.clevel compares the ATE for 
each level of treatment with the adjacent level, which can be useful if the 
categories are ordered. 


. * Contrasts of TEs after RA compared with adjacent category 
contrast ar.clevel, nowald 
warning: cannot perform check for estimable functions. 


Contrasts of marginal linear predictions 


Margins: asbalanced 


Contrast Std. err. [95% conf. interval] 
POmeans 
clevel 
(1 vs 0) 157.6892 56.28316 47 .37623 268.0022 
(2 vs 1) 208.1079 48.70919 112.6396 303.5762 
(3 vs 2) 835.1662 46.80351 743.433 926.8994 


The largest impact comes from moving from clevel 2 to 3, a move from 
Medigap to ESI insurance. 


24.10.4 AIPW estimates 


The rationale for using the AIpw method in the MLT setting is the same as in 
the single level case; see section 24.6.5. Because the treatment has more than 
two categories, a multinomial model is needed to predict probabilities for 
each category for each individual. This requires a probability weight for each 
sample observation. The teffects aipw command uses an MNL specification 
that is the simplest model for unordered categories. 


For illustrative purposes, we treat insurance choice as exogenous, after 
controlling for insurance premia and the degree of market penetration of 
Medicare managed care in the individual’s region. 


. * Insurance regressors used for the MNL propensity scores 
. global zlist preml prem2 prem3 prem4 penet 


. describe $zlist 


Variable Storage Display Value 
name type format label Variable label 
prem1 double %12.0g Prem. of ESI w/ RX 
prem2 double %12.0g Prem. of ESI w/o RX 
prem3 double %12.0g Prem. of Medigap w/ RX 
prem4 double %12.0g Prem. of Medigap w/o RX 
penet double %12.0g MMC penetration rate in county from 


previous year 


summarize $zlist 


Variable Obs Mean Std. dev. Min Max 
prem1 7,664 67 .01624 46.26629 (0) 489 
prem2 7,664 92.2942 54.32717 (0) 354 
prem3 7,664 170.16 67 . 30359 (0) 600 
prem4 7,664 133.8445 38.1126 (0) 570 
penet 7,664 11.42253 13.83415 . 0039809 49.51 


Before using teffects aipw, we fit the multinomial model using the 
mlogit command to check the predicted probabilities. 


* Predicted probabilities from the MNL model 
. qui mlogit clevel $zlist, base(0) 


. predict pshati pshat2 pshat3 pshat4 
(option pr assumed; predicted probabilities) 


summarize dmedicare dmmc dmedigap desi pshati pshat2 pshat3 pshat4, sep(4) 


Variable Obs Mean Std. dev. Min Max 
dmedicare 7,664 .1375261 . 3444244 (0) 1 
dmmc 7,664 . 1526618 . 3596847 (0) 1 
dmedigap 7,664 .3071503 .4613424 (0) 1 
desi 7,664 .4026618 . 4904658 (0) 1 
pshat1 7,664 .1375261 .0409228 .0210833 . 2081109 
pshat2 7,664 . 1526618 . 2060478 .0196888 . 8223644 
pshat3 7,664 . 3071503 . 0962368 .0461926 .5731233 
pshat4 7,664 .4026618 .080105 . 1045894 .5314694 


The predicted probabilities on average equal the sample frequencies, a 
property of the MNL model. Some of the predicted probabilities are close 
to 0, giving some observations great weight, though nowhere near as close 
to 0 as the default tolerance of 0.00001 used by the teffects aipw 
command. 


The teffects aipw command yields the following ATE estimates: 


* Augmented IPW ATE 
. teffects aipw (drugexp $xlist, poisson) (clevel $zlist) 


Iteration 0: EE criterion = 3.550e-19 

Iteration 1: EE criterion = 2.808e-25 

convergence not achieved 
The Gauss-Newton stopping criterion has been met but missing standard errors 
indicate some of the parameters are not identified. 


Treatment-effects estimation Number of obs = 7,664 
Estimator : augmented IPW 
Outcome model : Poisson by ML 


Treatment model: (multinomial) logit 


Robust 
drugexp | Coefficient std. err. Zz P>|zl [95% conf. interval] 
ATE 
clevel 
(1 vs 0) 246.6138 76.06505 3.24 0.001 97 .52905 395.6986 
(2 vs 0) 370.0061 51.55025 7.18 0.000 268.9694 471.0427 
(3 vs 0) 1209.351 58.194 20.783 0.000 1095.293 1323.409 
POmean 
clevel 
0 1050.274 41.47196 25.32 0.000 968.99 1131.557 


Warning: Convergence not achieved. 


We check covariate balance following use of the inverse-probability 
weights. 


* Check balance following use of the inverse-probability weights 
tebalance summarize 


Covariate balance summary 


Observations 
Treatment Raw Weighted 
Obn.clevel = 1,054 2,032.0 
1.clevel = 1,170 1,550.2 
2.clevel = 2,354 2,057.1 
3.clevel = 3,086 2,024.7 


Total = 7,664 7,664.0 


Standardized differences Variance ratio 
Raw Weighted Raw Weighted 
1.clevel 
prem1 . 0699667 .0278279 1.319427 1.146251 
prem2 -.0242137 -.0696167 1.351879 1.063149 
prem3 . 2159224 . 1015362 1.03429 1.250435 
prem4 . 2360121 .0104144 1.510058 . 8858702 
penet 1.871311 . 336323 1.228896 . 7590858 
2.clevel 
prem1 . 1023513 -.0015931 1.139751 . 978487 
prem2 . 0879926 .0065958 1.115772 1.093536 
prem3 - .0226479 -.0013236 .8979095 .9102739 
prem4 .0444557 .0283568 .8151792 .8676917 
penet - .0290098 .0342481 .9459218 1.078047 
3.clevel 
prem1 .020812 .0136956 1.068847 . 9948594 
prem2 .1175974 -.0218133 1.274272 1.197675 
prem3 -.0082211 -.0063766 . 9894477 .9738561 
prem4 -.0010937 . 0082907 1.034581 1.043341 
penet . 1328474 -.0105397 1.162845 .9723018 


The first set of output indicates that the Medicare and mmc plan observations 
are substantially upweighted and the Medigap and ESI observations are 
substantially downweighted. The remaining output indicates that the 
weighted variance ratios differ from one, so a better model for the propensity 
scores is warranted. 


The output from the teffects aipw command indicates convergence 
problems. This problem disappears if we fit the simpler Pw model using 
command teffects ipw (drugexp) (clevel $zlist). We do not pursue this 
further. 


An extension that controls for potential endogeneity of the level of 
insurance in this application is given in section 25.3. 


24.11 Conditional quantile TEs 


The chapter has focused on estimating mean TEs. Quantile TEs were 
introduced in section 15.5. As noted there, to interpret the quantile TE as 
measuring the TE for a given individual, one needs to make the strong 
assumption of rank invariance or rank preservation that is discussed further 
in section 25.8.4. 


The community-contributed poparms command (Cattaneo, Drukker, and 
Holland 2013) calculates conditional quantile TEs, under the assumption of 
conditional independence, using inverse propensity-score weighting or using 
an efficient-influence-function estimator that includes nonparametric 
estimation of potential outcomes and is a doubly robust estimator 
qualitatively similar to AIPw. 


We apply this command to the current multilevel example; it also applies 
to simpler binary treatment examples. The presentation is brief, and we use 
program defaults; for details, see Cattaneo, Drukker, and Holland (2013). 


The following poparms command with the ipw option gives identical Prw 
estimates and standard errors of the ATEs as the teffects ipw command. For 
brevity, the output is omitted. 


. * IPW using community-contributed poparms gives same results as teffects ipw 
. poparms (clevel $zlist) (drugexp), ipw 


. contrast r.clevel, nowald 


. teffects ipw (drugexp) (clevel $zlist) 


Quantile TEs estimated by IPW can be obtained using the quantile () 
option. We consider just the median, though simultaneous estimation for 
several quantiles is possible. The authors find that for the quantile estimates, 
bootstrap standard errors are more robust than analytical standard errors, 
though computational time can be considerable. We obtain 


. * IPW estimates at selected quantiles using community-contributed command poparms 
. set seed 10101 


. poparms (clevel $zlist) (drugexp $xlist), ipw quantiles(.50) 
> vce(bootstrap, reps(400)) 


Treatment Mean and Quantiles Estimation Number of obs = 7,664 
(inverse probability weighting) 


bootstrap 
drugexp | Coefficient std. err. z P>|zl [95% conf. interval] 
mean 
clevel 
0 983.4526 48 . 26087 20.38 0.000 888.863 1078.042 
1 1273.651 61.46666 20.72 0.000 1153.178 1394.123 
2 1439 .942 35.29639 40.80 0.000 1370.762 1509.121 
3 2284.622 39.7769 57.44 0.000 2206.66 2362.583 
q50 
clevel 
(0) 575.99 48.78364 11.81 0.000 480.3758 671.6042 
1 833.01 49.42015 16.86 0.000 736.1483 929.8717 
2 1113.86 37.59761 29.63 0.000 1040.17 1187.55 
3 1753.56 35.89961 48.85 0.000 1683.198 1823.922 


The output provides potential outcomes for the means automatically, in 
addition to the requested median potential outcomes. As expected, for right- 
skewed medical expenditure data, the median estimates are less than those 
for the mean. 


To then obtain TEs at the median, we apply the margins, pwcompare 
command to the results in the second reported equation. 


* TEs for the 25th quantile 


. margins i.clevel, pwcompare predict (equation (#2) ) 
warning: cannot perform check for estimable functions. 


Pairwise comparisons of adjusted predictions 
Model VCE: bootstrap 


Expression: Linear prediction, predict (equation (#2) ) 


Number of obs = 


Delta-method Unadjusted 

Contrast std. err. [95% conf. interval] 
clevel 
1 vs 0 257.02 65.22931 129.1729 384.8671 
2vs 0 537.87 62.16622 416.0264 659.7135 
3 vs 0 1177.57 61.5219 1056.989 1298.151 
2 vs 1 280.85 63.10141 157.1735 404.5265 
3 vs 1 920.55 59.07649 804.7623 1036.338 
3 vs 2 639.7001 51.60011 538.5657 740.8344 


To instead obtain the efficient-influence-function doubly robust 
estimates, omit the ipw option from the poparms command. 


7,664 


24.12 Additional resources 


The main command for TEs estimation in linear and standard nonlinear 
models under the conditional independence assumption is teffects. Its 
variant eteffects and the ERM commands can handle endogenous 
regressors and tobit-type selection that is relevant when the data are 
censored. For matching estimators, the community-contributed programs 
pscore and att by Becker and Ichino (2002) include some features that are 
not available with the teffects commands. The community-contributed 
kmatch command (Jann 2017) supports multivariate-distance and PSM, 
including coarsened exact matching. The community-contributed poparms 
command (Cattaneo, Drukker, and Holland 2013) provides quantile TEs. 


Imbens and Rubin (2015) give a particularly comprehensive coverage of 
the methods presented in this chapter. Treatment evaluation methods are 
Cameron and Trivedi (2005, chap. 25), Cunningham (2021), and Glewwe 
and Todd (2022). Abadie and Cattaneo (2019) and Huber (2019) provide 
excellent surveys. Especially with observational data, one should not 
blindly use the teffects commands. Appropriate diagnostic checks such as 
those for covariate balance, propensity-score overlap, assessing the 
unconfoundedness assumption, and robustness of results should be 
performed. 


24.13 Exercises 


— 


. Consider the examples of size and power calculations given in 


section 24.3.4. Example 1 shows the sample sizes required when 
significance level is fixed at 5% and required power is 0.80. 
Reproduce the table associated with this example using œ = 0.10 and 
power = 0.90. Explain and interpret the results. 


. Suppose that a particular RCT is restricted by financial considerations to 


treatment and control groups of size 150. The control group’s 
pretreatment average outcome is 16, with standard deviation of 4. 
Required power is 0.9. What is the minimum-size TE consistent with 
this design? 


. Consider the generated dataset and example of section 24.4.4. 


Manually implement the RA method of section 24.6.1 by two OLS 
regressions, compute predictions from these two models, and use these 
predictions to compute the POMs, ATE, and ATET. Verify that you obtain 
the same estimates as those from the command teffects ra, y D x, 
using appropriate options of that command. 


. Inspect the covariate balance summary output generated by the Stata 


command tebalance summarize given in section 24.9.4. The last two 
lines of the output suggest that the samples may not be balanced with 
respect to those two regressors. Use Stata’s tebalance density 
postestimation diagnostic command to explore whether this is a 
potential problem before or after the Pw estimator is applied. Repeat 
the exercise after applying the IPw-RA estimator. 


. The literature suggests that estimates of TE obtained using PSM may not 


be robust when the distribution of propensity scores in the treated and 
untreated groups significantly differs. Taking the specification of the 
model in section 24.9.6, repeat the TE calculation given there with the 
modification that all observations with propensity scores above 0.70 
and below 0.10 are excluded. Compare your results with those based 
on a recommended trimming design that would drop only the 
observations with propensity scores outside the (0.1, 0.9) range. Which 
TE estimates do you regard as most plausible? 


. Trimming samples to achieve better balance is recommended as a 


method of obtaining more robust TE estimates. This recommendation 


also applies to ipw and ipwra methods because computational 
instability may result from having a significant number of observations 
with probability weights close to 0 or close to 1. Apply the sample 
trimming suggestions in the immediately preceding question to the ipw 
estimator used in section 24.9.2 and the ipwra estimator used in 
section 24.9.5. 

. In this chapter, the focus was on the impact of a binary treatment 
variable (lottery) on out-of-pocket expenditures (cop). We will now 
designate enrollment into Medicaid, defined by the binary variable 
medicaid (which is related to lottery), as an alternative treatment 
variable. Enrollment into Medicaid is, strictly speaking, not 
exogenous, but we treat it as such for purposes of this exercise by 
making the assumption of selection on observables only. In the next 
chapter we cover methods (including the local average TE estimator) 
that are appropriate when the treatment variable is endogenous. Using 
the Stata commands from section 24.7 and the controls defined 
through the same macros $xlist $zlist, estimate the POM of 
Medicaid enrollees and of nonenrollees. For the RA model, compare the 
ATE estimate with the ATET estimate. What would you infer from the 
difference between the two estimates? 

. Again estimate the ATE, but this time use IPW regression with control 
variables $z1ist only. Is this estimate significantly different from the 
previous one obtained without applying Prw? How would you 
rationalize the difference (or lack of it) with the ATET estimate. What 
would you infer from the difference between the two estimates? 

. Again estimate the ATE of medicaid, but this time use IPw regression 
and control variables $z1ist only. Is this estimate significantly 
different from the previous one obtained without applying rw? How 
would you rationalize the difference (or lack of it)? 


Chapter 25 
Endogenous treatment effects 


25.1 Introduction 


Chapter 24 presented standard methods for estimation of treatment effects 
(TEs) when the treatment can be viewed as exogenous after inclusion of any 
control variables. These methods rely on the conditional independence 
assumption of section 24.5 that treatment status is independent of potential 
outcomes after inclusion of control variables, an assumption also referred to 
as unconfoundedness, ignorability, or selection on observables. 


Many observational data applications are for settings where this 
assumption is not reasonable, in which case treatment status is called 
endogenous or nonignorable or selected on unobservables. For example, an 
individual may enter a training program only if it is expected to increase 
earnings sufficiently to cover the cost of training, but these individual 
expectations are not observed by the researcher and may not be fully 
captured by observable characteristics of the individual. Then the methods 
of chapter 24 lead to inconsistent estimates. 


This chapter is devoted to methods for endogenous treatment (ET), using 
several different methods that vary in the strength of model assumptions 
and the types of data required. Many of the methods, though not all, assume 
the existence of an instrument that is correlated with treatment but is 
uncorrelated with potential outcomes after inclusion of regressors. 


We begin with a highly parametric approach in which normal 
distributions are specified for model errors. The TE is the same for all 
individuals (homogeneous effects) after controlling for regressors, but 
treatment is endogenous with a random component that is correlated with 
the random component of the outcome. This approach has been presented in 
several places in this book, including the linear simultaneous equation 
model in chapter 7, the selection model in chapter 19, and endogeneity in 
panel models and several nonlinear models. Despite this overlap, we 
provide here a unified treatment with emphasis on estimation of the effect 
of treatment. This highly parametric approach has the attraction of being 
applicable to models with outcome models that are either linear or nonlinear 
and treatments that are either discrete or continuous. Furthermore, Stata’s 


extended regression model (ERM) and ET commands implement this 
parametric approach for many models. 


The remainder of the chapter presents various less parametric “quasi- 
experimental” methods—namely, local average treatment effects (LATEs), 
the difference-in-differences (DID) estimator, and regression discontinuity 
(RD) design. The chapter concludes with some treatment extensions to 
quantile regression. All of these methods allow TEs to differ across 
individuals (heterogeneous effects). 


LATE provides a reinterpretation of instrumental variables (Iv) and two- 
stage least-squares (2SLS) given heterogeneous TEs. The DID estimator, 
introduced in section 4.8, overcomes endogeneity in treatment by using 
richer panel data or repeated cross-sectional data and making the additional 
assumption of parallel trends so that the conditional independence 
assumption holds. RD design is applicable to a special type of endogeneity 
where treatment occurs when an observable continuous measure, one that 
directly affects the outcome, crosses a threshold. For example, a high 
enough test score may lead to treatment, yet the outcome of interest is also 
determined in part by the test score. Then treatment in the neighborhood of 
the threshold can be viewed as similar to an randomized controlled trial 
(RCT). 


Estimation using these methods to control for both endogeneity and 
heterogeneity is in many cases straightforward. LATE uses 2SLS, DID uses 
ordinary least squares (OLS), and RD design uses local polynomial regression 
or OLS. The difficulty is in understanding the underlying assumptions and 
the subtleties of the various methods and their extensions. Furthermore, this 
is an active area of research, the presentation here is introductory, and 
refinements to these methods are still being established. 


The methods of this chapter presume that the inclusion of observable 
variables is not enough to control for selection into treatment. Chapter 28 
presents machine learning methods that can lead to TEs estimates that better 
control for selection into treatment, making the conditional independence 
assumption of chapter 24 more tenable. 


25.2 Parametric methods for endogenous treatment 


If treatment assignment is not ignorable, is the potential-outcome 
framework still appropriate? Is it meaningful to base causal inference on the 
framework of counterfactual causality? Extension of the causal parameter 
concept to nonrandomized observational settings is due to 

Rubin (1974, 1978). Endogeneity is a significant complication in 
estimation; however, once it has been handled using a suitable instrument, 
the treatment parameter is identifiable. Then, it is argued, one may continue 
to generate the hypothetical potential outcomes based on consistent 
estimation of the model parameters. Thus, the potential-outcome mean 
(POM) framework remains relevant even when treatment is endogenously 
chosen. 


Many commonly used models with ET consist of two equations, a linear 
outcome equation with treatment as an argument and a linear or nonlinear 
reduced-form treatment participation (or assignment) equation for the 
treatment variable that includes one or more instruments. Examples of this 
model structure appeared in chapters 7 and 19. If the random components of 
the two equations are correlated, then the treatment is said to be 
endogenous. IV estimation for the linear model was extensively dealt with in 
chapter 7. 


In this section, we focus on an empirically important variant of Iv in 


which the ET variable D is discrete, rather than continuous, and the reduced 
form for D may be nonlinear. 


25.2.1 Canonical model 


We specify the outcome yi for the ith subject to be a linear function of 
observable variables x; and an endogenous binary treatment participation 
decision indicator D;, with 


yi = aD; +x, B+ ui (25.1) 


Participation depends on the instrumental variable z; (which may be 
binary), 


Dj = W121 + X;Y2 + vi (25.2) 


D; is a latent variable that has observable counterpart D; generated by the 
observability condition 


ise (25.3) 


Richer models, also considered below, allow for nonlinear outcome 
models, interactions between treatment and the regressors in (25.1), an 
endogenous multivalued treatment, and more than one instrument. 


25.2.2 Assumptions 


We make the following assumptions in sections 25.2—25.4. 


1. The instrument z appears in the J-equation and does not appear in the 
y-equation. This is referred to as an exclusion restriction. The 
instrument may be continuous or discrete and in a special case is 
binary. 

2. Conditional on (x, z), Cov(z, v) = Cov(u, z) = Cov(x, u) = 0, but 
Cov(D,u) 4 0 because u and v are correlated. Furthermore, treatment 
D depends upon the instrument z in a nontrivial fashion; to emphasize 
dependence of D on z, we use the notation D (z). 

3. Responses are homogeneous across subjects; that is, there is no 
randomness of the coefficient a. 


These assumptions can be summarized by stating that the model has a 
recursive or triangular structure that does not permit simultaneous 
dependence between y and D. D impacts y, but there is no reverse 


feedback to D. Thus, causation runs from the instrument to the treatment 
and then to outcome. However, the causal connection from the instrument 
to the treatment has no role in the interpretation of TEs. Whereas some 
models would regard the instruments as the ultimate causal variable, and 
econometricians may carefully study the mechanism that connects it to the 
treatment variable, this connection is essentially ignored in the POM 
framework. 


We assume a linear probability model for D or that the distribution of 
the random error v; is standard normal, which implies a probit model for the 
binary treatment D. Under these assumptions, (a, B) are identified, and a 
estimates both the average treatment effect (ATE) and the average treatment 
effect on the treated (ATET). These methods that control for Cov(D, u) 4 0 
yield consistent estimates, whereas OLS estimates of (25.1) are biased. 


The key features of this model are maintained if the linear outcome 
(25.1) is replaced by a nonlinear equation such as a probit model for a 
binary outcome or by an ordered probit model for an ordered discrete 
outcome. In those cases, the errors in the extension to (25.1) and in (25.3) 
are assumed to be normally distributed. 


The outcome (25.1) can be generalized to a potential-outcomes or 
regression-adjustment model (see section 24.6.1) that interacts the ET D 
with all regressors. Then, 


yi = xB, tui if Dj =1 


This specifies a restricted form of heterogeneous TEs. 


In extension of (25.3) to a multivalued treatment model, we check for 
block triangularity, the treatment equations being regarded as a single 
block. If the outcome model (25.1) is linear and there are additional 
nontreatment endogenous regressors, then it may be possible in some cases 
to achieve triangularity either by reordering of equations or by substituting 
out some endogenous regressors using their reduced form. 


25.2.3 Estimation methods 


Given the assumption of joint normality of model errors, the models can be 
fit by (full-information) maximum likelihood (ML). For more complex 
models, this can require numerical integration methods. 


In some cases, assumptions can be relaxed, typically by assuming 
normality of individual errors but not necessarily requiring joint normality. 
Given a probit model for binary treatment, a two-step method obtains an 
estimate of the inverse Mills ratio that is added as an auxiliary regressor in 
the outcome equation, similar to the two-step procedure used in selection 
models; see section 19.6.4. Alternatively, a control function approach 
calculates a residual from the fitted probit model and includes this as an 
additional regressor in the outcome equation. 


25.2.4 Computing TEs 


In simple outcome models such as (25.1), the TE equals the estimated 
coefficient of the treatment variable. 


In more complex models that model separately each potential outcome 
or specify a nonlinear outcome model, the ATE or ATET can be computed by 
applying the margins command after estimation of the endogenous 
regressor model; see sections 13.7 and 17.6. This was illustrated in the 
context of binary outcome models (ivprobit), selection models (heckman), 
and event count models (nonlinear Iv or gmm) in their respective chapters; 
see sections 17.6, 19.6, and 20.7. And the margins command provides 
additional flexibility in defining different counterfactuals that go beyond the 
standard ATE calculation; see section 13.7.5 for an example. 


For some nonlinear models, the margins command may not be 
available. Provided, however, that predict or predictnl commands, or 
both, can be used postestimation, the TEs can be estimated by processing 
output generated by the predict command; see section 25.4.5 for an 
example. Because the predict command can be conditioned on specific 
combinations of variables for specific subpopulations if so desired, it is a 


particularly convenient and flexible method for generating ATEs as well as 
distributing of TEs in both linear and nonlinear models. 


The next two sections present Stata ERM and ET commands that fit 
various models with ET and conveniently compute the subsequent ATE and 
ATET. 


25.3 ERM commands for endogenous treatment 


Estimates of TEs in models with endogenous (or exogenous) treatments can 
be obtained using Stata’s ERM family of commands described in section 23.7. 
These commands are applicable to the linear model for outcomes, using the 
eregress command, and to the probit, ordered probit and interval regression 
models that are based on a latent variable model that is linear with a 
normally distributed error. 


25.3.1 The ERM commands 


The ERM commands are based on the assumption that the joint distribution of 
errors in outcome and treatment equations is a multivariate normal 
distribution. A unique feature of the multivariate normal distribution is that it 
can be expressed as a product of the conditional normal (for the outcome) 
and marginal normal (for the treatment). Adding triangularity or recursivity 
as an additional feature facilitates efficient computation of the model. 
Estimation is by ML. 


An example of a model that can be handled using eregress is a 
continuous outcome model with a binary treatment. To respect bivariate 
normality, we use the probit specification of the treatment equation and a 
linear specification of the outcome equation. By contrast, a logit 
specification of the treatment equation will not work, because the latent 
variable formulation of a logit model has a logistic distributed error. 


We consider the application of the suite of ERM commands for 
regression-based treatment evaluation in the presence of the following 
additional complications that may occur individually or jointly: outcome y in 
(25.1) depends on other endogenous variable (y2) that may be binary or 
continuous; treatment (t1) is endogenous; treatment may be binary or 
multilevel. Table 25.1 presents a flexible range of models and options, 
including, for example, the presence of ET as well as other endogenous 
variables. In all examples, it is assumed that errors are jointly normal. 


Table 25.1. Selected ERM commands when treatment is endogenous 


Examples of Stata-extended commands and optional subcommands 


Linear regression with a continuous ET 
eregress y x, endogenous(t1 = z x) 


Linear regression with a continuous endogenous regressor and ET 
eregress y1 x, endogenous(y2 = x z1) endogenous(t1 = z3 x) 


Linear regression with a continuous endogenous regressor and a binary ET 
eregress y1 x, endogenous(y2 = x z1) entreat(t1 = z3 x) 


Linear regression with a continuous endogenous regressor and a multivalued 
treatment 

eregress y1 x, endogenous(y2 = x z1) entreat(t2 = z3 x) 

Linear regression with a continuous endogenous regressor, multivalued 
treatment, and selection 

eregress y1 x, endogenous(y2 = x z1) entreat(t2 = z3 x) select(sl=w x) 


The first two examples in table 25.1 are a linear model with continuous 
endogenous treatment variable t1 and instrument z that is excluded from 
the outcome equation for y. The second example adds a continuous 
endogenous regressor y2 to the first model. 


The next three examples again have a continuous outcome, but now the 
ET is discrete. In the third example, the treatment is binary and by default is 
modeled by a probit model given the assumption of normal errors. In the 
fourth example, the treatment is multilevel and by default is modeled by an 
ordered probit model given the assumption of normal errors. An example is 
three levels of health insurance calibrated in terms of the generosity of 
coverage. 


The eprobit and eoprobit commands for binary and ordered discrete 
outcomes with endogenous regressors or ET effects, or both, have syntax that 
is essentially the same as for extended linear regressions. The extended 
interval regression (eintreg) command is used for tobit-type selection 
models. 


The estat teffects postestimation command provides the ATE estimate 
as default; options provide the estimated ATET and the poms. If the model is 
fit with option vce (robust), rather than with default standard errors, then 


estat teffects gives unconditional standard errors that additionally allow 
for variation in the regressors; see section 13.7.9. 


25.3.2 Interpretation of ET effects 


The impact of a change in an endogenous variable, including a treatment 
variable, is subtly different from that of an exogenous variable. For an 
exogenous variable, the ATE is defined as the average change in the outcome 
due to a unit change in the regressor, holding all other regressors constant. In 
the linear model (25.1), the coefficient of the regressor is the ATE. 


When the regressor is endogenous, however, it includes the effect of the 
individual specific random-error term and, in that sense, is the total effect of 
the regressor and the effect of the idiosyncratic error v; in (25.2). Averaging 
over all impacted observations requires also averaging over the random 
component. Such averaging is interpreted as application of the average 
structural function; for details, see Blundell and Powell (2003) and 
Wooldridge (2010, 24—25). The resulting impact parameter is then called 
either the average structural mean or, in a binary outcome model, the 
average structural probability. In a linear outcome model with zero-mean 
errors, such averaging does not change the usual interpretation of the 
coefficient. In a nonlinear model, such as a probit outcome model, the 
average structural function is nonlinear, and averaging involves numerical 
integration. Thus, even though the model specifies the TE to be the same for 
all individuals if treatment was exogenous, the endogeneity leads to 
heterogeneous effects in the case of a nonlinear outcome model. 


25.3.3 Endogenous multivalued TEs application 


We apply the eregress command to the prescription drugs expenditure with 
the multilevel treatment example previously analyzed in section 24.10, 
where now the multilevel treatment is allowed to be endogenous. ML is used 
to fit the joint model of outcome and treatment. This involves placing 
additional structure on the specified model. 


Because joint normality is required, the outcome variable is log drug 
expenditures (1drugexp), and a log-linear outcome model is specified, rather 


than analyzing the level of drug expenditures by Poisson regression as in 
section 24.10. The treatment variable is level of health insurance (clevel1), 
which takes three discrete values (in addition to basic Medicare) that are 
now treated as endogenous. An ordered probit model is used to model the 
choice of health insurance where endogeneity of insurance choice is 
modeled by allowing for error correlation and by instruments that affect 
insurance choice but not drug expenditures. 


The objective of the analysis is to estimate the ATE and ATET for each type 
of treatment. The assumption of recursiveness between treatment and 
outcomes is maintained but also extended by allowing for presence of 
unobserved correlated random factors that affect both the choice of the 
treatment clevel and outcome ldrugexp. 


The base level of health insurance is Medicare that only offers zero 
prescription drug coverage and serves as the baseline counterfactual. The 
other levels are mmc (Medicare managed care), Medigap (Medicare 
supplemental insurance), and Est (employer-sponsored insurance). 


The following code drops 7% of the sample with 0 annual drug 
expenditure and creates the insurance variable (cleve1) that is ordered by 
increasing levels of insurance cover. 


. * Read in data, drop zero drug spending, create ordered multivalued treatment 
. qui use mus224mcbs 


. generate drugexp = aamttot 


. drop if drugexp == 0 
(508 observations deleted) 


. qui generate ldrugexp = 1n(drugexp) 

. qui replace income_c = income_c/1000 

. qui tabulate coverage, generate (inslevel) 
. qui summarize inslevel* 


. qui generate clevel= 0 


. qui replace clevel=0 if inslevel3==1 // Base Medicare 
. qui replace clevel=1 if inslevel2==1 // MMC (Medicare managed plan) 
. qui replace clevel=2 if inslevel4==1 // Medigap (Medicare suppl ins) 


. qui replace clevel=3 if insleveli==1 // ESI (Employer sponsored) 


We then verify that indeed the outcome increases on average with the 
level of treatment, very substantially for levels 2 and 3 compared with levels 
0 and 1. The subsequent analysis will control for individual characteristics 


and possible endogeneity of insurance choice. 


. * eregress example: log drug expenditure by insurance type 
. regress ldrugexp ibn.clevel, noheader noconstant 


ldrugexp 


clevel 
(0) 


1 
2 
3 


6.465467 
6.561296 
6.890744 
7.306901 


Coefficient Std. err. 


. 0402077 
.0359018 
.0253979 
.0219283 


t 


160.80 
182.76 
271.31 
333.22 


P>|t| 


0.000 
0.000 
0.000 
0.000 


[95% conf. interval] 


6.386648 
6.490918 
6.840956 
7.263915 


6.544286 
6.631674 
6.940531 
7.349887 


There are two sets of regressors. The first set (xlist) consists of 
individual socioeconomic characteristics that are included in both the 
outcome equation and the treatment choice equation. The second set (z1ist) 
consists of regressors (instruments) that are excluded from the outcome 
equation. These are insurance premiums (prem1—prem4) for available plans 
and a variable (penet) reflecting the competition (market penetration) in the 
local insurance market in which the plan is purchased. 


* eregress: Summary stats for outcome, exogenous variables, and instruments 


. global xlist h_age h_male income_c genhelth // Exogenous in both 


. global zlist prem1 prem2 prem3 prem4 penet // Instruments 


summarize ldrugexp $xlist $zlist 


Variable Obs Mean Std. dev. Min Max 
ldrugexp 7,156 6.959769 1.236787 (0) 10.91388 
h_age 7,156 77 . 30604 7.237295 65 104 
h_male 7,156 . 4491336 . 4974406 (0) 1 
income_c 7,156 30.19316 47 .08639 (0) 2000 
genhelth 7,156 2.665735 1.053193 1 5 
prem1 7,156 67 .07746 46.14253 (0) 489 
prem2 7,156 92.14425 54.41507 (0) 354 
prem3 7,156 169.9314 67.17174 (0) 600 
prem4 7,156 133.9728 37 .32712 (0) 570 
penet 7,156 11.43595 13.85367 .0039809 49.51 


The eregress command with option entreat (clevel) recognizes that 
clevel takes multiple discrete values, fits a separate potential-outcome 


model for each level of treatment, and uses an ordered probit model for the 
treatments, with treatment choice endogenous. 


. * eregress: Endogenous multivalued treatment and regression-adjustment model 
. eregress ldrugexp $xlist, entreat(clevel = $xlist $zlist) vce(robust) nolog 


Extended linear regression Number of obs = 7,156 
Wald chi2(20) = 270515.14 
Log pseudolikelihood = -20024.711 Prob > chi2 = 0.0000 
Robust 
Coefficient std. err. Zz P>|zl [95% conf. interval] 
ldrugexp 
clevel#c.h_age 
0 . 0042307 .0057991 0.73 0.466 -.0071353 .0155968 
1 . 0035396 .0053761 0.66 0.510 -.0069975 0140766 
2 . 0048934 .0031718 1.54 0.123 -.0013231 .01111 
3 .0006181 .0032107 0.19 0.847 -.0056747 .0069109 
clevel#c.h_male 
0 -. 1954782 .0912511 -2.14 0.032 -.3743271  -.0166293 
1 -. 160039 .0871775 -1.84 0.066 - . 3309038 .0108257 
2 -.077942 .0575325 -1.35 0.175 -. 1907035 . 0348196 
3 -.1556137 .0428371 -3.63 0.000 -.2395729 -.0716545 
clevel# 
c.income_c 
0 .0022249 0017465 1.27 0.203 -.0011981 0056479 
1 - . 0003385 .0021949 -0.15 0.877 - . 0046405 . 0039635 
2 .000125 .0018354 0.07 0.946 - . 0034723 . 0037222 
3 - . 000383 . 0006372 -0.60 0.548 -.001632 . 0008659 


clevel# 


c.genhelth 
0 . 336533 .0383245 8.78 0.000 . 2614183 .4116477 
1 . 3608828 03661 9.86 0.000 . 2891284 - 4326371 
2 . 2831774 .0228076 12.42 0.000 . 2384752 . 3278795 
3 . 2948639 .0201427 14.64 0.000 . 255385 . 3343428 
clevel 

0 4.80542 . 7802745 6.16 0.000 3.276111 6.33473 
1 5.252532 . 4904076 10.71 0.000 4.29135 6.213713 
2 5.717317 .2771843 20.63 0.000 5.174046 6.260588 
3 6.812011 . 3253975 20.93 0.000 6.174244 7.449779 

clevel 
h_age - .0013482 .0018959 -0.71 0.477 - .0050641 . 0023677 
h_male -.0998152 .0287945 -3.47 0.001 -.1562514 -.0433789 
income_c 0054154 .0015643 3.46 0.001 .0023494 .0084814 
genhelth - .0078822 .0132423 -0.60 0.552 - .0338366 .0180723 
prem1 0001451 .0002809 0.52 0.605 - .0004055 . 0006957 
prem2 .0009054 . 0002899 3.12 0.002 .0003371 .0014737 
prem3 -.000038 .0002012 -0.19 0.850 - . 0004324 . 0003565 
prem4 -.0006851 . 0004436 -1.54 0.123 -.0015546 . 0001844 
penet -.0137505 .0011206 -12.27 0.000 -.0159469 -.0115541 

/clevel 
cut1 -1.387702 . 1778247 -1.736232 -1.039172 
cut2 -. 7940095 . 1783897 -1.143647 -.4443721 
cut3 .0619699 . 1789969 - . 2888576 .4127974 
var (e. ldrugexp) 1.372352 . 1577232 1.095564 1.71907 


corr(e.clevel, 
e.ldrugexp) 


-.2350591 . 2832739 -0.83 0.407 -.6789779 


. 3347248 


The first set of output gives the fitted coefficients for the outcome model at 
each level of insurance. For example, the coefficient of age in the outcome 
equation fitted for individuals with clevel = 2 is 0.00489, so for those 
individuals, drug expenditures increase by approximately 0.5% with each 
year of aging. The second set of output gives the ordered probit fitted 
regression coefficients and the three cutoffs. 


The last line shows that the covariance between the two equation errors 
is negative and is statistically insignificant at the 5% level. Negative 
correlation means that unobserved random factors that increase the 
probability of higher level of insurance actually decrease the probability of 
higher expenditure. This is not consistent with adverse selection. It is 


consistent with, for example, unobserved risk aversion that leads to both 
choice of more generous health insurance and healthier habits. 


Next we use the estat teffects command to estimate ATE for the three 
insurance levels relative to no insurance beyond basic Medicare. 


. * eregress: ATE for endogenous multivalued treatment with unconditional 
> standard errors 
. estat teffects, ate 


Predictive margins Number of obs = 7,156 
Unconditional 
Margin std. err. z P>|zl [95% conf. interval] 
ATE 
clevel 
(1 vs 0) . 397106 . 3890177 1.02 0.307 -. 3653546 1.159567 
(2 vs 0) .8102799 . 5270496 1.54 0.124 -.2227183 1.843278 
(3 vs 0) 1.555395 . 9227738 1.69 0.092 -. 2532084 3.363998 


For MMC, Medigap and Est the ATE is estimated to be, respectively, 39.7%, 
81.0%, and 155.5% higher than for those in Medicare only. The results for 
ATET, obtained using option atet, are within one percent of the ATE 
estimates. 


The preceding estimates are based on model estimation using robust 
standard errors. If instead we use default standard errors in estimation we 
obtain the same ATE estimates but the standard errors are considerably 
smaller. 


. x eregress: ATE following regression with default standard errors had 
> smaller standard errors 
. qui eregress ldrugexp $xlist, entreat(clevel = $xlist $zlist) nolog 


. estat teffects, ate 


Predictive margins Number of obs = 7,156 
Model VCE: OIM 


Delta-method 


Margin std. err. z P>lz| [95% conf. interval] 
ATE 
clevel 
(1 vs 0) .397106 . 1897812 2.09 0.036 .0251416 . 7690704 
(2 vs 0) .8102799 . 254762 3.18 0.001 .3109556 1.309604 
(3 vs 0) 1.555395 . 445699 3.49 0.000 .681841 2.428949 


Note: Standard errors treat sample covariate values as fixed and 
not a draw from the population. If your interest is in 
population rather than sample effects, refit your model 
using vce(robust). 


The output from estat teffects includes a statement that the standard 
errors are specific to the sample, which is the usual approach in applied 
microeconometrics studies. For this example, however, the larger standard 
errors in the initially provided ATE estimates are due to the robust standard 
errors for the error variance and covariance being larger, rather than due to 
allowing covariate values being a draw from the population. 


The preceding potential-outcomes model specifies that the mean 
outcome for those with insurance level 7 is y; = aj + xB; j =0,1,2,3.A 
less flexible model that restricts 6; = 6 and specifies that 
Yi = De a,;Dj;; + xi can be estimated using the nointeract option of 
entreat (). In that case, the fitted coefficients &; and their standard errors 
are equivalent to the estimated ATE and ATET and their standard errors. We 


obtain 


. * eregress: Endogenous multivalued treatment & simpler no interaction model 
. qui eregress ldrugexp $xlist, entreat(clevel = $xlist $zlist, nointeract) 
> vce (robust) 


. estat teffects, ate 


Predictive margins Number of obs = 7,156 
Unconditional 
Margin std. err. Zz P>|z| [95% conf. interval] 
ATE 
clevel 
(1 vs 0) .5264812 . 3014017 1.75 0.081 - .0642552 1.117218 
(2 vs 0) -9794175 -4106409 2.39 0.017 .1745761 1.784259 
(3 vs 0) 1.833951 . 17217249 2.54 0.011 -4193963 3.248506 


The TEs are 10%—20% larger than in the potential-outcomes variant of this 
model. The more restricted model leads to greater precision because the 
standard errors are approximately 25% smaller. 


The reader is reminded that these results are obtained using a dataset that 
excludes 508 sample individuals who reported zero expenditure, which 
potentially creates a tobit-type selection problem; this can be tackled by 
adding the select () option to the eregress command. 


The eregress command is flexible. It can simultaneously allow for 
endogenous (nontreatment) regressors, ET, or exogenous treatments. The key 
restrictions are the recursive structure and normal distribution assumption. If 
a specified model is not recursive, it can be made recursive (or triangular) 
using some “tricks” of triangulation; see the [ERM] Stata Extended 
Regression Models Reference Manual, “How to triangularize a system of 
equations.” 


25.4 ET commands for binary endogenous treatment 


An alternative to the ERM commands are the ET commands. These commands 
are restricted to binary treatment modeled by a probit model. The three 
commands are etregress for a linear outcome model, etpoisson for an 
exponential conditional mean outcome model, and eteffects for a range of 
outcome models summarized below. 


25.4.1 The etregress command 


The etregress command has the following syntax: 


etregress depvar | indepvars | , treat(depvar_t = indepvars_t) [ 3 options | 
where the subscript ¢ denotes the treatment equation variables. 


A simple application is etregress y x, treat (D = x z), which applies 
to the canonical model in section 25.2.1. The poutcomes option allows the 
outcome model errors to vary with treatment status [see (25.4)], specifying a 
trivariate normal distribution for the errors Uoi, U1;, and vi. To allow 
interaction between the treatment variable and the outcome regressor, one 
needs to explicitly define the interactions, for example, etregress y i.D# 
(c.xl c.x2), treat(D = c.xl c.x2 z). 


Three estimation methods are available. 


The default option is ML estimation based on joint normality of the errors 
u; and v; in (25.1) and (25.2). This yields exactly the same result as the 
eregress command. 


The option twostep delivers estimates based on a two-step estimator in 
which a probit model is fit first, an estimate of the inverse Mills ratio 
(“lambda”) or hazard of receiving treatment is fit for each observation, and 
at the second stage, the outcome equation is reestimated with “lambda” 
added as an auxiliary variable. This two-step estimator relaxes the joint 
normality assumption of the errors and parallels the so-called heckit 
procedure used in selection models; see section 19.6.4. 


The option cfunction yields the same estimates as option twostep but 
does so by stacking all equations and fitting the model by (just-identified) 
generalized methods of moments, similar to the linear control function 
example in section 13.3.11. 


Note that if the treatment equation is expressed as a linear probability 
model, rather than a probit model, then Stata’s ivregress 2sls command 
yields an alternative 2SLs estimate of the TE. The margins postestimation 
command yields various TEs estimates following estimation of the model. 
For the margins command, option predict (cte) gives the ATE, and the 
additional option subpop () enables estimation of the ATET. The option 
predict (yctrt) gives E(y|treated), option predict (ycntrt) gives 
E(y|untreated), and option predict (ptrt) gives the average probability of 
treatment. If model estimates are obtained using option vce (robust), rather 
than with default standard errors, then margins gives unconditional standard 
errors that additionally allow for variation in the regressors; see 
section 13.7.9. 


25.4.2 Endogenous binary treatment application 


The dataset is a 2003 extract from the Medical Expenditure Panel Survey of 
individuals over the age of 65 years analyzed in section 7.4.2. The outcome 
equation has the dependent variable 1drugexp, the log of total out-of-pocket 
expenditures on prescribed medications. The binary ET variable is an 
indicator for whether the individual holds either employer- or union- 
sponsored health insurance (hi_empunion). Other regressors are number of 
chronic conditions (totchr) and four sociodemographic variables: age in 
years (age), dummy variables for female (female) and black or Hispanic 
(blhisp), and the natural logarithm of annual household income in 
thousands of dollars (1inc). As an instrument for insurance (hi_empunion), 
we use a measure of affordability—the ratio of social security income to 
total income (ssiratio). 


Supplementary insurance for drug coverage is a choice variable for the 
near universal Medicare insurance for the elderly. Conditional correlation 
between hi_empunion and ldrugexp could arise if those who expect high 


future drug expenses are more likely to choose a job that provides 
supplementary health insurance upon retirement. 


The model fit is identical to that in section 25.2.1, with treatment 
appearing as a single regressor, rather than as a regression-adjustment model 
with interaction with all regressors. The ML estimates, along with the first- 
stage probit estimates that indicate that the chosen instrument is robust, are 


as follows. 


. * etregress: ML for endogenous binary treatment (insurance) 
. qui use mus207mepspresdrugs, clear 


. global x2list totchr age female blhisp linc 


. etregress ldrugexp $x2list, treat(hi_empunion = $x2list ssiratio) 


> vce(robust) first nolog 


Probit regression 


Log likelihood = -6282.5099 


Number of obs = 10,068 


LR chi2(6) = 829.98 
Prob > chi2 = 0.0000 
Pseudo R2 = 0.0620 


hi_empunion Coefficient Std. err. Z P>lzl| [95% conf. interval] 
totchr .0370418 .0101307 3.66 0.000 .0171861 .0568976 

age - .0240092 .0020265 -11.85 0.000 -.027981 - . 0200374 

female -.2015789 .0264087 -7.63 0.000 -.253339 -.1498189 
blhisp -.1847738 .0367088 -5.03 0.000 -.2567217 -.1128259 

linc . 1216865 .0157349 7.73 0.000 .0908468 . 1525263 
ssiratio -.6169911 .041986 -14.70 0.000 -.699282 -.5347001 
_cons 1.552571 . 1621934 9.57 0.000 1.234678 1.870464 


Linear regression with endogenous treatment Number of obs = 10,068 
Estimator: Maximum likelihood Wald chi2(6) = 1876.40 
Log pseudolikelihood = -22654.396 Prob > chi2 = 0.0000 
Robust 
Coefficient std. err. z P>l|zl| [95% conf. interval] 
ldrugexp 
totchr . 4550432 .0108265 42.03 0.000 . 4338237 .4762627 
age -.0179835 . 0024374 -7.38 0.000 -.0227606 -.0132064 
female -.06001 . 0303896 -1.97 0.048 -.1195725  -.0004475 
blhisp -. 2479348 .039701 -6.25 0.000 -.3257474 -.1701222 
linc . 1267941 .0186249 6.81 0.000 .09029 . 1632981 
1.hi_empunion -1.384199 .1029986 -13.44 0.000 -1.586072 -1.182325 
_cons 7.240257 . 2033296 35.61 0.000 6.841738 7.638776 
hi_empunion 
totchr 0405104 .0100773 4.02 0.000 .0207593 .0602616 
age - .0242543 .0020402 -11.89 0.000 -.0282531  -.0202556 
female -. 1924384 .0262529 -7.33 0.000 -.2438931 -.1409837 
blhisp -. 1928805 .0358713 -5.38 0.000 -.2631871 -. 122574 
linc . 1278699 .0161179 7.93 0.000 . 0962794 . 1594604 
ssiratio -.5506714 .0400378 -13.75 0.000 -.629144 -.4721987 
_cons 1.508993 . 1649894 9.15 0.000 1.18562 1.832367 
/athrho . 7639572 0612615 12.47 0.000 . 6438869 . 8840276 
/lnsigma . 3460358 0197529 17.52 0.000 . 3073209 . 3847507 
rho -6434019 .0359013 . 5675403 . 7084313 
sigma 1.413453 0279197 1.359777 1.469248 
lambda . 9094185 .0672652 .7775812 1.041256 
Wald test of indep. eqns. (rho = 0): chi2(1) = 155.51 Prob > chi2 = 0.0000 


The estimated correlation rho between equation errors is 0.643 and is very 
precisely estimated; the Wald chi-squared statistic for test of zero error 
correlation is 155.51. The output includes sigma, the estimated standard 
deviation of the output equation error, and a quantity lambda, which is 
simply rho times sigma. 


Because the outcome equation is linear, the ATE and ATET are identical 
and equal the estimated coefficient of — 1.384, the coefficient of 
hi_empunion. For illustrative purposes, we nonetheless use the margins 
postestimation command to obtain the average marginal effect (AME). 


. * etregress: ATE of insurance status following ML estimation 
. margins, predict (cte) 


Predictive margins Number of obs = 10,068 
Model VCE: Robust 


Expression: Conditional treatment effect, predict(cte) 


Delta-method 
Margin std. err. z P>lz| [95% conf. interval] 


-cons -1.384199 . 1029986 -13.44 0.000 -1.586072 -1.182325 


If the model specification included interaction terms involving the 
endogenous dummy variable and other regressors, as in (25.4), then both ATE 
and ATET would be different from the fitted slope coefficient. 


The cfunction option yields the following estimates of the ATE and ATET. 


. * etregress: Control function estimator for endog binary treatment (insurance) 
. qui etregress ldrugexp $x2list, treat(hi_empunion = $x2list ssiratio) 
> cfunction vce(robust) 


. margins, predict (cte) 


Predictive margins Number of obs = 10,068 
Model VCE: Robust 


Expression: Conditional treatment effect, predict (cte) 


Delta-method 
Margin std. err. z P>|z| [95% conf. interval] 


_cons -.8711352 . 1824591 -4.77 0.000 -1.228748 -.5135219 


The estimated TE is — 0.871, substantially lower than — 1.384 estimated by 
ML. The estimated confidence interval is also wider. The estimator uses an 
estimated value of the inverse Mills ratio or hazard function from the probit 
regression as an additional variable to control for endogeneity. From output 
not included, the coefficient is highly significant, again confirming 
endogeneity of the insurance variable. 


The twostep option leads to the same parameter estimates as the 
cfunction option; the twostep option provides only default standard errors, 
while cfunction can additionally yield heteroskedastic—robust standard 
errors. 


The following example fits a richer model that interacts treatment status 
with control variables and uses option poutcomes to allow the outcome 
model errors to vary with treatment status. The subsequent margins 
command includes the subpop () option to obtain the ATET. 


. * etregress: More flexible potential-outcomes model and ATET 
. global x3list c.totchr c.age i.female i.blhisp c.linc 


. qui etregress ldrugexp i.hi_empunion#($x3list) , 
> treat (hi_empunion = $x3list ssiratio) vce(robust) nolog poutcomes 


. margins, predict(cte) subpop(if hi_empunion==1) 


Predictive margins Number of obs = 10,068 
Model VCE: Robust Subpop. no. obs = 3,850 
Expression: Conditional treatment effect, predict(cte) 
Delta-method 
Margin std. err. Zz P>lz| [95% conf. interval] 
_cons -2.453818 .0502733 -48.81 0.000 -2.552352 -2.355284 


The ATET equals — 2.454. 


25.4.3 The etpoisson command 


The etpoisson command has essentially the same syntax as the etregress 
command. 


The model is the same as the canonical model (25.1)—(25.3) in 
section 25.2.1, except that (25.1) is replaced by 


yi = exp(aD; + xi 34+ ui) (25.5) 


The error terms u; in (25.5) and v; in (25.2) are assumed to be jointly normal 
distributed with correlation p. The model can be fit by ML. Using properties 
of the normal distribution, one can find analytical expressions for the ATE 
and ATET from this model. 


While the command is named etpoisson, it is more precisely an et 
command for outcome with exponential conditional mean. If y is actually a 


count, the model is still suitable because it has the same exponential 
conditional mean as a Poisson regression model and controls for 
overdispersion because F-{exp(u;)} > 1 if us is normal with mean zero, 
leading to Var (y; |Di, X;) > E(yi lD}, x;). 


25.4.4 Application to count data with ET 


We apply the etpoisson command to the number of doctor visits data from 
the U.S. Medical Expenditure Panel Survey 2003, which was analyzed in 
chapter 10. 


The Poisson regression has dependent variable docvis and endogenous 
regressor private (insurance) with exogenous regressors female, chronic 
(number of chronic conditions), and age (in years divided by 10). 


For this illustration, we work with a random sample of 2,000 
observations and truncate the number of doctor visits at 50. The data 
summary follows. 


. * etpoisson sample: Subsample of young MEPS sample 
. qui use mus210mepsdocvisyoung, clear 


. set seed 10101 
. qui sample 2000, count 


. keep if docvis < 50 
(7 observations deleted) 


. regress docvis private, vce(robust) noheader 


Robust 
docvis Coefficient std. err. t P>|t| [95% conf. interval] 
private 2.59169 . 2744089 9.44 0.000 2.053531 3.129849 
_cons 1.415042 . 2222057 6.37 0.000 .9792618 1.850822 


The average number of doctor visits is 1.415 for those without private 
insurance and is 2.592 higher for those with private insurance. Without 
control variables, the AME or ATE is the slope estimate 2.592. 


We first obtain Poisson regression estimates under the assumption that 
private is exogenous; only the marginal effect (ME) is reported. 


. x poisson: AME = ATE of exogenous binary treatment using margins and finite 
> diffs 

. qui poisson docvis i.private i.chronic i.female age, vce(robust) 

. margins, dydx(i.private) 

Average marginal effects Number of obs = 1,993 
Model VCE: Robust 


Expression: Predicted number of events, predict() 
dy/dx wrt: 1.private 


Delta-method 
dy/dx std. err. z P>|zl [95% conf. interval] 


1.private 2.316279 . 2860775 8.10 0.000 1.755577 2.87698 


Note: dy/dx for factor levels is the discrete change from the base level. 


The AME is 2.316 following Poisson regression with control variables 
included and is highly statistically significant. 


Next we assume endogeneity of private and specify as instrument 
income (annual income in thousands of dollars). This example is purely 
illustrative because in practice income can be expected to have a direct effect 
on the number of doctor visits. 


* etpoisson: ML for endogenous binary treatment 
. etpoisson docvis i.chronic i.female age, 


> treat (private= i.chronic i.female age income) vce(robust) nolog 
Poisson regression with endogenous treatment Number of obs = 1,993 
(24 quadrature points) Wald chi2(4) = 1066.08 
Log pseudolikelihood = -5085.0789 Prob > chi2 = 0.0000 
Robust 
Coefficient std. err. Zz P>|zl [95% conf. interval] 
docvis 
1.chronic 1.360064 . 1121328 12.13 0.000 1.140287 1.57984 
1.female . 6409438 .0750489 8.54 0.000 . 4938508 . 7880369 
age . 1144535 .0526475 2.17 0.030 0112663 . 2176407 
1.private 1.431485 . 131891 10.85 0.000 1.172983 1.689986 
_cons -2.086742 . 1580321 -13.20 0.000 -2.39648 -1.777005 
private 
1.chronic . 229521 . 0794046 2.89 0.004 .0738909 .3851511 
1.female .3199315 .0729008 4.39 0.000 . 1770486 . 4628143 
age .0795119 .0356403 2.23 0.026 .0096581 . 1493657 
income .0319134 . 0044206 T.22 0.000 . 0232492 .0405775 
_cons -.4277604 . 1745324 -2.45 0.014 -. 7698376 - .0856833 
/athrho -. 1764639 .0700389 -2.52 0.012 -.3137376 -.0391903 
/lnsigma . 1740968 . 0200664 8.68 0.000 . 1347673 . 2134264 
rho -.1746548 . 0679024 - . 3038335 -.0391702 
sigma 1.190171 . 0238825 1.144271 1.237912 
Wald test of indep. eqns. (rho = 0): chi2(1) = 6.35 Prob > chi2 = 0.0118 


The correlation coefficient (rho in the table) for the errors is — 0.17 and is 
statistically significant at 5%, confirming strong evidence that private should 


be treated as an endogenous variable. 


We can compute the associated ATE using the margins command. 


* etpoisson: ATE following ML for endogenous binary treatment 
. margins, predict(cte) vce(unconditional) 


Predictive margins 


Number of obs = 1,993 


Expression: Conditional treatment effect, predict (cte) 


Unconditional 
Margin std. err. z P>|z| [95% conf. interval] 
_cons 3.754155 . 3672285 10.22 0.000 3.034401 4.47391 


The implied ATE is 3.754 visits compared with 2.316 in the Poisson model 
that treated private insurance as exogenous. 


The ATET is obtained by restricting attention to those with private 
insurance. 


. * etpoisson: ATET following ML for endogenous binary treatment 
. margins, predict(cte) subpop(if private==1) vce(unconditional) 


Predictive margins Number of obs = 1,993 
Subpop. no. obs = 1,634 
Expression: Conditional treatment effect, predict(cte) 
Unconditional 
Margin std. err. Zz P>lz| [95% conf. interval] 
_cons 3.678841 . 301032 12.22 0.000 3.088829 4.268853 


25.4.5 Manual computation of TEs 


When the margins command is not available as a postestimation command, 
the predict command can be used to calculate the TE “manually” as follows. 


Consider an arbitrary variable x. Consider two possible values of 
interest, say, 7* and x* + 5. Using the predict command, we evaluate an in- 
sample prediction of the outcome variable y at two points y(x = x*) and 
y(x = x* + 6). The ME is then defined by {y(x = 2* + ô) — y(x = x*)}/ô. 
Averaging this over the full (or the treated) sample yields an estimate of ATE 
(or ATET). 


As an illustration, we apply this method to compute the ATE in the current 
application. 


. * etpoisson: Manual computation of ATE using predict command 
. qui etpoisson docvis i.chronic i.female age, 
> treat (private= i.chronic i.female age income) vce(robust) nolog 


. preserve 
. qui replace private=1 if private== 
. qui predict docvisi if private==1, pomean 
. qui replace private=0 if private== 
. qui predict docvisO if private=-0, pomean 


. generate ate = docvisi - docvis0O 


summarize ate docvisi docvis0O 


Variable Obs Mean Std. dev. Min Max 
ate 1,993 3.751386 3.228703 1.068299 12.34738 
docvis1i 1,993 4.929249 4.242453 1.403725 16.22422 
docvisO 1,993 1.177863 1.013751 . 3354255 3.876841 

. restore 


The ATE estimate of 3.751 differs slightly from 3.754 obtained earlier using 
the command margins predict (cte) but is identical to the estimate using 
the command margins, r.private. A bootstrap will yield the standard error 
of the ATE estimate. 


25.4.6 The eteffects command 


The eteffects command specifies separate potential outcomes for the 
treated and untreated, where the model has a nonlinear conditional mean and 
an additive error. The command has the following syntax: 


eteffects (ovar omvarlist) l; omodel noconstant |), 
(tvar tmvarlist E noconstant |) lif | [ in | [ weight | E stat options] 


where ovar is the dependent variable of the outcome model, omvarlist is the 
list of exogenous indepvars in the outcome model, tvar is the binary ET 
variable, and tmvarlist is the list of covariates that predict treatment 
assignment. 


The outcome model can be linear, probit, exponential mean, or fractional 
probit. The stat options are ate, atet, Or pomeans. 


For example, consider an exponential conditional mean. Then 
Yu = exp(x,3,) + uy; and yo; = exp(x/G,) + Uoi. Estimation is by a 
control function or residual augmentation approach. A probit model for the 
ET is specified, and the fitted residual ©; is computed and is added as a 
regressor in the potential-outcome models. We then obtain OLS estimates of 
Yri = exp(x, 8, + 10i) + urs and yor = exp(x; Bo + YoUi) + uoi. 


25.5 The LATE estimator for heterogeneous effects 


In a classic RCT, the TE can be heterogeneous, varying from individual to 
individual, because estimating the ATE requires computation only of y; and Yo 
. By contrast, the canonical model (25.1)-(25.3) specifies the same 
homogeneous TE a for all individuals. The regression-adjustment or potential- 
outcomes model (25.4) does allow for a limited form of heterogeneity. In the 
remainder of this chapter, we focus on methods that estimate aggregate TEs 
such as ATE when individual TEs are heterogeneous and only limited 
restrictions are placed on the nature of this heterogeneity. 


In this section, we present a leading example, the LATE estimator. The 
basic Iv or 2SLS estimate @ is usually interpreted as estimating the same TE for 
all individuals. LATE provides a reinterpretation when TEs are heterogeneous. 


We provide a very brief summary of LATE. For further details, see Angrist, 
Imbens, and Rubin (1996), a lengthy presentation in Angrist and 
Pischke (2009, chap. 4), and, for example, Cameron and Trivedi (2005, 
chap. 25.7). 


25.5.1 LATE with binary treatment and binary instrument 


We consider a binary outcome y, binary treatment J, binary instrument z, 
and no additional control variables. A setting where this is particularly 
appropriate is where z equals one if randomly assigned to treatment in an 
RCT, D is the treatment indicator, but not all individuals assigned to treatment 
actually get the treatment. 


We consider heterogeneous TEs. The effect of treatment on the outcome is 
Ay/AD, which can be decomposed as (Ay/Az)/(AD/Az), leading to the 
TE 


E(y|z = 1) — E(y|z = 0) 


PORSI E@|z=0) oo) 


TEWald = 


This is the population analog of the Wald estimator. TEwaia can be shown to 
equal Cov(Z, y)/Cov(Z, D); see, for example, Angrist and Pischke (2009, 
127). So TEwala can be consistently estimated by Iv regression of Yy on D 
with instrument z. (These results do not restrict D to be binary.) 


The Iv estimator is usually interpreted as being the estimate of a in the 
homogeneous TEs model y; = 6 + aD; + ui. In what follows, we simplify 
the expression in (25.6) when TEs are heterogeneous. 


For simplicity, throughout section 25.5 we consider the case where z = 1 
on average pushes individuals toward treatment (D = 1) rather than toward 
nontreatment. Given binary D and z, the literature distinguishes between four 
types of individuals that exhaust all possible types. A complier may switch 
toward treatment if z = 1, a defier may switch away from treatment if z = 1, 
an always-taker always receives treatment regardless of the value of z, and a 
never-taker is always untreated regardless of the value of z. Table 25.2 
provides formal definitions. 


Table 25.2. LATE: The four types of individuals 


Storage type Bytes 
Compliers (C’) D = 1 when z = 1 and D = 0 when z = 0 
Defiers (D) D = 0 when z = 1 and D = 1 when z = 0 


Always-takers (A) D = 1 regardless of the value of z 
Never-takers (N) D = 0 regardless of the value of z 


Now consider the numerator in (25.6). Given the four possible types of 
individuals defined in table 25.2, we have 


B(ylz = 1) - E(ylz = 0) 
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where the terms for always-takers and never-takers equal zero by definition. 
So the numerator confounds compliers and defiers. 


A key component of the LATE interpretation of Iv is to assume that there 
are no defiers. Then the numerator simplifies to a term involving only 
compliers, 


B(ylz = 1) — Blylz = 0) = Pr(C{B(ylz = 1,C) — Blylz = 0,0)} 25.7) 


Furthermore, given no defiers, only compliers change to treatment status as z 
changes, so for binary D we have that the denominator in (25.6) equals the 
probability of being a complier, 


E(D|z = 1) — E(D|z = 0) = Pr(C) (25.8) 


Substituting (25.7) and (25.8) into (25.6) and canceling the common term 
Pr(C) yields 


TEWald = E(y|z E 1,€) — E(y|z = 0, C) 


The Wald estimand, which can be consistently estimated by Iv, therefore 
equals the TE for compliers. 


25.5.2 Assumptions for LATE 


The assumptions underlying the LATE estimator are a combination of those for 
the potential-outcomes framework, those for Iv estimation, and the 
assumption of no defiers. 


Index potential outcomes Yo and yı by treatment status D = 0,1, and 
index potential treatment status Dp and D, by the instrument z = 0,1. We 
make the following assumptions: 


1. Independence: (y1, yo, D1, Do) is jointly independent of z. 

2. Exclusion: z does not directly determine potential outcomes yı and Yo. 
3. First stage: E(D,) 4 E(Do). 

4. Monotonicity: Pr( Dı > Do) = 1. 


The fourth assumption is replaced by Pr(Do > Dı) = 1 if instead z = 1 
on average pushes individuals toward nontreatment ( D = 0). The assumption 
rules out defiers. It is called the monotonicity assumption because for 
nonbinary instrument z, it generalizes to the assumption that the probability 
of treatment is nondecreasing in z for all individuals. 


Assumption three is straightforward to check from data, but the remaining 
assumptions are at best only partly testable. Some justification for these 
assumptions should be provided in any application. 


25.5.3 Further discussion of LATE 


LATE estimates a TE for compliers. One cannot identify whether an individual 
is a complier, so the ability to generalize LATE estimates is limited. However, 
one can identify the fraction of the sample that is compliers. Given 
monotonicity and the binary setup, (25.8) gives the expression for Pr(C), 
which can be estimated by first-stage regression of treatment D on the 
instrument z. 


LATE does not give the ATET, because the treated group includes both 
compliers and always-takers. Similarly, LATE does not give the ATE for the 
untreated, because this group includes never-takers. 


In practice, there may be more than one binary instrument. In that case, Iv 
with different instruments leads to different LATE estimates because different 


instruments lead to different groups of compliers. The 2SLs estimate using all 
instruments at once equals a weighted average of the individual LATE 
estimates. 


The analysis so far does not include covariates. If the instrument is 
randomly assigned with all individuals having the same probability that z = 1 
, then there is no need for covariates. In other settings, the assumptions in 
section 25.5.2 may be reasonable only after conditioning on additional 
variables x. Adding such variables may also improve estimator efficiency. 


When the regressors x take only a few distinct values, one can estimate 
by Iv where the model for outcome y and the first-stage model for treatment 
D are fully saturated models with full sets of dummies for each discrete value 
of x. The resulting estimated coefficient of D in the outcome equation is a 
weighted average of covariate-specific LATE estimates. 


Abadie (2003) considers estimation of E'(y;|Dj;,x;, D1; > Doi), the 
conditional mean of the outcome given treatment status and control variables 
with attention restricted to compliers. Compliers cannot be individually 
identified, but they can be in expectation. Using a linear model for the 
conditional mean as an approximation, Abadie (2003) proposes the weighted 
least-squares estimator that minimizes 


N 
S(a, B) = X Rilyi — aD: — x18) 
i=1 


where the «, are inverse-probability weights, called kappa weights, that are 
consistent estimates of 


Dil — E(z:|Di; yi Xi)} {U — Di) E(zi|Di, yi, xi) f 
1 — Pr(z; = 1|x;,) Pry, = 11%) 


k= 


(25.9) 


The rationale for the weights is that given the monotonicity assumption, there 
are no defiers, so that we are left with compliers after subtracting away 


always-takers, who have Do; = 1 so D;(1 — z;) = 1, and never-takers, who 
have Dı; = 0 so (1 — D;)z; = 1. The expectation E'(z;|D;, Yi, Xi), rather 
than the binary 2:, is used in (25.9) to ensure that the weights *; are positive. 


Angrist and Imbens (2005) consider generalization to treatment D that 
takes several values. Then the 2SLs estimator is a weighted average of per-unit 
average causal effects for compliers. 


25.5.4 Application of LATE 


The LATE framework is especially applicable in RcTs where the offer of 
treatment (z) is randomly assigned but actual receipt of treatment ( D) is 
voluntary. Then z is clearly a valid instrument for D and is a strong 
instrument unless exceptionally few people take up the offer of treatment. 


We return to the Oregon Health Insurance Experiment studied in 
sections 24.8—24.9. A simple model regressed out-of-pocket expenditure 
(cop) on whether one wins or loses the lottery (Lottery) and some lottery 
controls (xlist). It was found that winning the lottery was associated with 
lower out-of-pocket expenditure, an estimated effect that is an intention-to- 
treat effect. 


Here we instead consider the effect of actual treatment, where the 
treatment variable medicaid equals one if the person actually enrolls in 
Medicaid. We have 


. * LATE: Treatment D is Medicaid and instrument z is lottery win/lose 
. qui use mus224ohiesmallrecode, clear 


. label variable medicaid "Enrolled in Medicaid" 


. tabulate medicaid lottery, cell nokey 
Selected in the 


Enrolled in lottery 
Medicaid Not selec Selected Total 
Not enrolled 9,873 6,428 16,301 
43.53 28.34 71.88 
Enrolled 1,530 4,848 6,378 
6.75 21.38 28.12 
Total 11,403 11,276 22,679 
50.28 49.72 100.00 


Have 49.7% of the sample was selected by the lottery, and 28.1% enrolled in 
Medicaid. Note that 6.75% of the sample was in Medicaid, even though they 
were not selected in the lottery. The monotonicity assumption of no defiers 
means that all of these people are assumed to be always-takers. 


More than one lottery occurred, and lotteries varied with some household 
characteristics. These variables are included as controls in the global xlist. 
We therefore regress oop On medicaid and controls xlist, where medicaid 
equals one if the person actually enrolls in Medicaid. The LATE estimate is 
obtained by 2SLs estimation with lottery an instrument for medicaid. We 
obtain the following estimates. 


. * LATE: Treatment is Medicaid and instrument is lottery win/lose 
. global y oop // Outcome variable for this chapter 


. global xlist dhhsize2 dhhsize3 dlotdraw* dsurvdraw* // x variables for treat 
. qui regress $y lottery $xlist, vce(cluster household_id) 

. estimates store intent 

. qui regress $y medicaid $xlist, vce(cluster household_id) 

. estimates store ols 

. qui regress medicaid lottery $xlist, vce(cluster household_id) 

. estimates store first 

. qui ivregress 2sls $y (medicaid = lottery) $xlist, vce(cluster household_id) 
. estimates store iv 


. estimates table intent ols first iv, keep(medicaid lottery) 
> b(%10.3f) se stat(N r2) 


Variable intent ols first iv 
medicaid -170.331 -137.074 
9.177 33.703 
lottery -40.924 0.299 
10.130 0.006 
N 22679 22679 22679 22679 
r2 0.002 0.012 0.114 0.011 


Legend: b/se 


The first column of the output gives the intention-to-treat effect, replicating 
the section 24.9 result that being selected by the lottery leads to a $40.92 
reduction in out-of-pocket expenditures. The second column provides the 
potentially inconsistent OLS estimate of a $170.33 reduction in out-of-pocket 
expenditures for those enrolled in Medicaid. The third column gives the first- 
stage regression estimate and implies that the complier group is 29.9% of the 
sample. 


The final column shows the preferred LATE estimate of a $137.07 
reduction in out-of-pocket expenditures, with a standard error of $33.70. This 
is a very large TE because the sample average of oop is $269.01. When the 
control variables are dropped, the LATE estimate becomes — 151.01 witha 
standard error of 33.33. 


In summary, the causal effect of Medicaid enrollment is to reduce out-of- 
pocket expenditures on average by $137 for the 30% of the sample that is 
compliers. 


25.5.5 Marginal TEs 


The LATE framework has been extended to define the marginal treatment 
effect (MTE) function; see, for example, Heckman and Vytlacil (2005). 


A treatment selection equation D = 1(yz + x’y. + v > 0) can be 
rewritten as Pr{p(z,x) > Up} where p(-) is the propensity score Pr( 
D = 1|z,x) and Up is uniformly distributed on (0, 1). Up is interpreted as an 
individual’s unobserved (after conditioning on z and x) propensity or 
indifference or resistance to treatment. 


The MTE(x, u) is then the ATE conditional on x, as usual, and additionally 
on Up = u. 


The community-contributed mtefe command (Andresen 2018) allows 
estimation of the distribution of various TEs fitting both parametric and 
semiparametric MTE models. 


The MTE can be decomposed as the difference in two potential-outcome 
functions: E'(y;|x, u) and E'(yo|x, u). The community-contributed 
mtebinary command (Kowalski, Tran, and Ristovska 2016) uses this 
distinction to enable finer distinction between always-takers, compliers, and 
never-takers. 


There is a tension between a structural approach that requires strong 
assumptions and quasi-experimental approaches such as LATE that require 
fewer assumptions but are not as generalizable. Heckman (2010) critiques 
LATE using the MTE framework. Imbens (2010) provides a response that details 
what can be learned from quasi-experimental methods such as LATE. 


25.6 Difference-in-differences and synthetic control 


The fixed-effects (FE) estimator for linear panel-data models controls for 
endogeneity by assuming that the error can be decomposed as a; + uit and 
that the endogenous regressor is correlated only with the time-invariant 
component @;. Mean-differencing or first-differencing eliminates a; and 
enables consistent estimation. 


This approach requires repeated measures for each individual over time. 
The DID method does not require repeated measures but does add the 
assumption of parallel trends; see the lengthy introductory presentation in 
section 4.8. In this section, we summarize some extensions. 


25.6.1 DID 


A common application of the DD method is to settings where individual- 
level data are available over time but the binary treatment ( D) occurs at a 
more aggregate level, such as state or village. 


Then we consider the two-way fixed-effects model 


Yist = Ps + HU + aDst + Xib + Uis, t=1,...,N (25.10) 


where s denotes state, ¢ denotes time, į denotes an individual who, unlike the 
panel case, is observed only once (that is, only in one particular state and 
time period), and X;st are exogenous control variables. The parallel trends 
assumption is that separate state fixed effects ¢, and time fixed effects y: are 
adequate, rather than more general state-time fixed effects. Interest lies in 
estimating the ATET parameter a. 


OLS estimation of (25.10) using the regress command is straightforward 
with inference based on standard errors clustered at the state level. 
Alternatively, the didregress command, introduced in Stata 17, fits this 
model and some variants and additionally provides some diagnostics and 


better methods to compute standard errors when there are few clusters (here 
states). 


The didregress command has syntax 


didregress (ovar omvarlist) (tvar| ; continuous ]) [ if | [ in ] | weight | , group (groupvars) 


[ time (timevar) options | 


Here ovar is the outcome, omvarlist are control variables, tvar is the 
treatment variable that can be binary (the default) or continuous (option 
continuous), groupvars are the grouping variables, and timevar is the time 
variable (should there be one). 


For the model (25.10), there is a single group variable (state) and a time 
variable. An alternative DID example may use contrast across two distinct 
groups, such as state and gender, rather than state and time. Then two 
groupvars are provided, and the time () option is dropped. For difference in 
difference in differences an additional groupvar is provided. 


The related xtdidregress command can be used when panel data are 
available, rather than repeated cross-sections. Then the group-specific 
estimate effect @, in (25.10) is replaced by an individual-specific effect ¢;. 


The default standard errors are cluster-robust standard errors with 
clustering at the group level. The didregress command controls for group- 
specific effects using the areg command. The xtdidregress command 
controls for group-specific effects using the xtreg, fe command. Thus, 
these two commands use different formulas for degrees-of-freedom 
adjustments in computing cluster—robust standard errors. The areg formula 
leads to larger standard errors, and we favor use of the xtreg, fe formula; 
see section 6.6.4 for a detailed discussion. 


A common complication in DID analysis is that there can be few clusters, 
or few treated clusters, in which case cluster—robust standard errors lead to 
tests with the wrong size and with the wrong coverage. The wildbootstrap 
option provides ¢ statistics and confidence intervals based on the wild cluster 
bootstrap presented in section 12.6. The option vce (hc2) bases inference on 
an alternative bias-corrected variance—covariance matrix of the estimator, 
and the dlang option uses the aggregated data of Donald and Lang (2007). 


Developing better inference methods in this setting remains an active area of 
research. 


A key assumption of the DID method is the parallel trends assumption 
that y in (25.10) does not vary by treatment status. The estat trendplots 
postestimation command provides a graphical diagnostic of the parallel 
trends assumption, and the estat ptrends postestimation command 
provides a test of linear parallel trends. The estat granger postestimation 
command provides a test of whether a TE occurs before the actual time of 
treatment. The estat grangerplot command plots coefficients from leads 
and lags of the treatment indicator variable. 


Some DID applications have differential timing of when treatment occurs. 
Goodman-Bacon (2018) shows that the DID estimator is then a weighted 
average of all possible two-group two-period DID estimators. Callaway and 
Sant’ Anna (2021) propose separate estimation for each distinct treatment, 
followed by aggregation to obtain more precise estimates. 

Wooldridge (2021) proposes a two-way Mundlak estimator. 


that is efficient under suitable assumptions and contrast this estimator with 
other recently proposed estimators. 


The DD methodology is quite flexible and can be extended beyond the 
setting considered here. Some extensions for nonlinear regressions are 
covered in Athey and Imbens (2006). Other extensions from the perspective 
of propensity scores and matching methods are covered in Lechner (2011). 
Huber (2019) provides many references. Miller (2021) reviews the closely 
related subject of event study models. 


25.6.2 Synthetic control 


Synthetic control methods measure the TE for a treated unit as the difference 
between the outcome for the treated observation and a single counterfactual 
(untreated) value of the outcome. The counterfactual, called a synthetic 
control, is a weighted average of the outcomes for untreated observations. 
The weights are selected so that the synthetic control is similar to the treated 
unit in a pretreatment period. 


An advantage of synthetic control compared with DD is that the parallel 
trends assumption that changes over time are the same for treated and 
untreated individuals, after controlling for group fixed effects and regressors, 
can be relaxed. Synthetic control essentially allows for an interactive effect 
syi in (25.10). Statistical inference, however, is challenging. 


We illustrate synthetic control, with little explanation of the underlying 
theory, using the example of Abadie, Diamond, and Hainmueller (2010) who 
estimated the effect on cigarette consumption of a large-scale tobacco 
control program instituted in California in 1988. 


The analysis is based on the following panel dataset: 


. * Synthetic control: Smoking dataset and summary 
. qui use mus225smoking, clear 


. summarize, sep(7) 


Variable Obs Mean Std. dev. Min Max 
state 1,209 20 11.25929 1 39 
year 1,209 1985 8.947973 1970 2000 
cigsale 1,209 118.8932 32.7674 40.7 296.2 
lnincome 1,014 9.861634 . 1706769 9.397449 10.48662 
beer 546 23.4304 4.22319 2.5 40.4 
age15to24 819 .175472 .0151589 . 1294482 . 2036753 
retprice 1,209 108.3419 64.38199 27.3 351.2 


The panel is one of 39 states observed over the 31 years from 1970—2000. 
Some variables are not observed in all years. 


Synthetic control methods compare a treated unit, here California, with a 
control unit that is a weighted sum of the untreated units. Let s denote a state 
where s = 1 for the treated state and s = 2,..., J +1 forthe J untreated 
states. And let t = 1,..., To, To + 1,..., T , where treatment occurs 
between To and Tọ + 1. 


Time-invariant weights Ws, s = 2,..., J + 1, are selected so that in each 
year of the pretreatment period, a weighted combination of untreated states 


has similar outcome to that in the treated state: y4; ~ 7 pals WeYats 


t=1,...,7 . These weights that sum to one are determined by an algorithm 
given in Abadie, Diamond, and Hainmueller (2010), for example, that uses 


data on the outcome and control variables in the pretreatment period. 
Typically, the weights are zero for all but a few untreated units. 


Given these weights, the estimated TE is simply the posttreatment period 
difference between the outcome for California and the weighted average of 
the outcome in the untreated states. 
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For the current example, the variables used to determine the weights are beer 
consumption per capita (beer), log state income per capita (1nincome), retail 
price of cigarettes (retprice), and percent of state population aged 15—24 
years (age15to24). These variables are used as averages over the entire 
pretreatment period or that part of the pretreatment period for which they are 
available. Additionally, the outcome cigsale in the selected years 1975, 
1980, and 1988 is used. 


The community-contributed synth command (Abadie, Diamond, and 
Hainmueller 2014) obtains the weights. The option trunit (3) identifies the 
treated state—California appears as the third state in the panel. The option 
trperiod(1989) defines the posttreatment period as beginning in 1989. We 
have 


. * Synthetic control: synth command 
. tsset state year 


Panel variable: state (strongly balanced) 
Time variable: year, 1970 to 2000 
Delta: 1 unit 


. qui synth cigsale beer(1984(1)1988) Inincome(1972(1)1988) retprice 
> age15to24 cigsale(1988) cigsale(1980) cigsale(1975), 
> trunit(3) trperiod(1989) figure 


The lengthy output from the command is omitted. It includes the weights 
that are 0 for all states except Colorado (0.285), Connecticut (0.101), 
Nevada (0.245), and Utah (0.369). The option figure gives a plot identical 
to the first panel of figure 25.1. 


The community-contributed synth runner package (Galiani and 
Quistorff 2017) is an extension to the synth command that provides useful 
graphs and statistics. We have 


* Synthetic control: synth_runner command 
synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice 
> agel5to24 cigsale(1988) cigsale(1980) cigsale(1975), 
> trunit(3) trperiod(1989) gen_vars 
Estimating the treatment effects 
Estimating the possible placebo effects (one set for each of the 1 treatment 
> periods) 
| | Total: 38 
PERDRE E Had atu tke edt a E bless bee als | 20.00s elapsed. 


Conducting inference: 5 steps, and 38 placebo averages 
Step 1... Finished 
Step 2... Finished 
Step 3... Finished 
Step 4... Finished 
Step 5... Finished 


Post-treatment results: Effects, p-values, standardized p-values 


estimates pvals pvals_std 

ci | -7.887098 . 1315789 0 
c2 | -9.693599 . 1842105 0 
c3 -13.8027 . 2105263 0 
c4 -13.344 . 1315789 0 
c5 -17 . 0624 . 1052632 0 
c6 -20 . 8943 .0789474 0 
c7 -19.8568 . 1315789 .0263158 
c8 -21.0405 . 1578947 0 
c9 -21.4914 . 1052632 .0263158 
c10 -19.1642 . 1842105 .0263158 
c11 -24.554 . 1052632 0 
c12 -24.2687 . 1052632 .0263158 


The first column of the output gives the estimated TEs in each year. For 
example, the causal effect of the tobacco-control program is estimated to 
have reduced cigarette sales by 24.27 packs 12 years after the intervention. 
This is a very large effect because the sample average was 118.89 packs. The 
remaining columns of the output will be explained below. 


The postestimation command effect graphs produces two plots that 
compare the outcome over time for the treated unit with that for the synthetic 
control. 


* Synthetic control: synth_runner postestimation command effect_graphs 
effect_graphs, trlinediff(-1) tc_options(ti("Treated & control outcomes") ) 
> effect_options(title("Difference between treated & control")) 


. graph combine tc effect, xcommon ysize(2.5) xsize(6) scale(1.4) 
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Figure 25.1. Synthetic control: Outcomes and difference in 
outcomes 


The first panel of figure 25.1 plots over time the outcome for California 
and for the synthetic control. The two are very similar in the pretreatment 
period, as expected because the weights are selected to ensure this. In the 
posttreatment period, the two diverge. The second panel of figure 25.1 plots 
the difference in the curves. For example, the endpoint in year 2000 is an 
estimated reduction of approximately 25; more precisely, this is the estimate 
of — 24.27 reported in the output from the synth_runner command. 


The postestimation command single treatment graphs produces two 
plots that compare the outcome over time for the treated unit with those for 
each of the untreated units. 


. * Synthetic control: synth_runner postestimation command single_treatment_graphs 
. Single_treatment_graphs, trlinediff(-1) effects_ylabels(-50(10)50) 


> do_color(gs11) effects_ymax(50) effects_ymin(-50) 
> raw_options(title("Treated & donor outcomes") ) 
> effects_options(title("Difference between treated & donor outcomes") ) 


(6 real changes made) 
(0 real changes made) 


. graph combine raw effects, xcommon ysize(2.5) xsize(6) scale(1.3) 
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Figure 25.2. Synthetic control: Treated and donor outcomes 


The first panel of figure 25.2 plots the outcomes over time for California 
(the solid line) and for all 38 other states. 


The solid line for California in the second panel of figure 25.2 is the 
same as that in the second panel of figure 25.1. The remaining curves are 
placebo curves where the synthetic control method is applied separately to 
each untreated state (with California omitted in forming the synthetic 
control). 


Considering the final year of 2000, for example, from the second panel 
of figure 25.2, one of the 38 placebos has a response less than — 24.27, the 
California response, and 3 have a response that exceeds — 24.27. If 
treatment was randomly assigned, then by a two-sided permutation test, the 
p-value is 4/38 = 0.10526316. This is the value given in the last entry in the 
second column of the output from the synth_runner command. The third 
column provides an alternative p-value that divides posttreatment effects for 
all units by pretreatment match quality as measured by root mean squared 
prediction error. 


In fact, treatment is not randomly assigned, but statistical inference for 
synthetic control under the alternative of nonrandom assignment is 
challenging. For robustness and diagnostic tests, and extensions to more than 
one treated state, see, for example, the recent detailed survey by 
Abadie (2021). 


25.7 Regression discontinuity design 


A natural experiment in econometrics is an observational setting, rather than a 
controlled experiment setting, in which an intervention is observed that is 
strictly exogenous and elicits a causal response in a variable of interest. If the 
intervention can be validly interpreted as “treatment”, then the nontreatment 
observations serve as a counterfactual, making possible inference about the 
causal impact of the intervention. A key feature of a natural experiment is that 
even though the intervention does not follow any experimental design, treated 
and nontreated samples are generated analogously to a randomized trial. Then a 
parameter such as ATE or ATET can be identified and estimated. 


RD design is a leading example of a natural experiment that is generated by 
the existence of a discontinuous treatment assignment based on a 
nonmanipulable threshold. An example is enrollment eligibility into the 
U.S. Medicare public health insurance program, which occurs when an 
individual crosses the threshold age of 65. One can estimate the average impact 
on healthcare use of Medicare eligibility by comparing samples of individuals 
on either side of the age 65 threshold, though this comparison needs to also 
control for the complication that healthcare use itself varies considerably with 
age. 


RD methods were originally proposed in the educational psychology 
literature as an alternative to RCT. Hahn, Todd, and Van der Klaauw (2001) 
provide the underlying theory. A key assumption, which is empirically testable, 
is that close to the point of discontinuity, also called the cutoff, the treated and 
untreated individuals are well matched in their observed characteristics. The 
estimated TE then applies to those who are close to the cutoff point; that is, the 
effect is “local” and possibly not capable of extrapolation to a larger and more 
inclusive population. 


In this section, we focus mainly on sharp regression discontinuity (SRD), the 
simplest and leading type of RD design, and use of the rdrobust package of 
Calonico, Cattaneo, and Titiunik (2014) and Calonico et al. (2017). A detailed 
presentation of RD methods is given in Lee and Lemieux (2010). 


25.7.1 Sharp regression discontinuity design 


In an RD design, treatment status is determined by the value taken by a so- 
called running variable or forcing variable x;. Under SRD, there is a known 
nonmanipulable cutoff c such that no treatment occurs below the threshold, 
while all cases that cross the threshold receive treatment. Thus, the treatment 
variable D; = 1(x; > c), so that Pr(D; = 1|x; > c) = 1 and 

Pr(D; = 1|x; < c) = 0. This rule determining treatment is independent of 
individual characteristics, and hence by design, there is no idiosyncratic or 
selection element in treatment. All of those selected for treatment are 
compliers. 


Interest lies in the effect of treatment on an outcome variable yi. Using 
potential-outcome notation for a binary treatment, we are interested in 
estimating the ATE E[y1; — yo;|. For identification of the TE, additional 
assumptions are needed. To rule out the possibility of confounding, we assume 
that there are no other possible sources of discontinuity in response such as a 
discontinuity in the running variable x or a discontinuity in the density function 
of the random-error term on the regression. Both sources of discontinuity must 
be ruled out by assumption. 


If TEs are homogeneous, then the estimated ATE is the estimate of a from 
the OLS regression 
where f(x;) is a specified function of the running variable, such as a 
polynomial in x;. A richer model adds interaction terms D; x f(x;). 


This parametric approach uses all the data but has the limitation of 
requiring correct specification of the functional form for F (y;| Di, xi). 


An alternative local approach considers only outcomes in the neighborhood 
of the threshold value c. Then, 


ATESRD = Yy — y7 


where y+ = lim, _,.+ E(y;|2; = c) and y~ = limpe- E(y;|x; = c). This 
local approach allows for heterogeneous TEs. 


A simple local approach compares sample means of y on either side of the 
threshold, but this fails to control for the fact that y varies with x in general and 
not just at c. The preferred approach is to instead use a local linear regression 
on either side of the threshold. This requires choice of how local to be, 
essentially choice of a kernel bandwidth, and can mean a loss of efficiency 
because only a subset of the data is being used. 


Under the assumption of local randomization of treatment assignment, 
conditional independence holds so that y;, yo; L D,. It is therefore important 
that the threshold c be nonmanipulable. For example, if the running variable is 
the number of employees in a firm and government regulations favor firms 
with at most 20 employees, then manipulation may occur, leading to a 
bunching of firms at or just below 20 employees. 


25.7.2 SRD numerical and graphical illustration 


This subsection provides a numerical illustration of RD concepts and issues 
pertaining to the estimation of TEs. Subsequent subsections use the more 
specialized community-contributed commands rdplot and rdrobust to obtain 
graphs and estimates. 


A sample of size 500 is generated using a regression model for outcome y 
that is a quadratic function of a running variable x. The data-generating process 
(DGP) for y is subject to a single discontinuity at the midpoint of the sample 
where x = 0 because of the treatment assignment variable D that takes the 
value 1 if z > 0 and value 0 if x < 0. For this DGP, the homogeneous TE is 80, 
with y~ = —10 and y* = 70. 


* SRD DGP: Quadratic in running variable x and TE 80 at x = 0 
. clear all 


. qui set obs 500 

. set seed 10101 

. generate x = 0.25*(_n - 250) + rnormal(0,5) // The running variable 

. generate xsq = x°2 

. generate D = x > 0 // The sharp cutoff 

. generate y = -10 + 80*D + 2*x - 0.025*xsq + rnormal(0,60) // The outcome 


. summarize y x D 


Variable Obs Mean Std. dev. Min Max 
y 500 -2.063917 126.896 -344.0993 250.661 
x 500 . 1132601 36.33044 -71.85318 69.72129 
D 500 .5 . 5005008 (0) 1 


RD methods make extensive use of graphical tools because these provide a 
good initial feel for the presence of a significant discontinuity. Skipping the 
intermediate step of formally testing for discontinuity in z at the cutoff x = 0, 
we next display the data as a scatter diagram that also displays quadratic 
regressions (and associated 95% confidence intervals) fitted separately to the 
treated and nontreated observations. 


. * SRD design example: Scatterplot with separate global quadratic fits 
. twoway (scatter y x, xline(0) yline(-10, lpat(dash)) yline(70, lpat (dash) ) 


> msize(vsmall) xtitle("Running variable x") ytitle("Outcome") leg(off)) 
> (qfitci y x if D==1, lcolor(black)) (qfitci y x if D==0, lcolor(black)), 
> title("RD: Scatterplot and global quadratic fits") 
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Figure 25.3. SRD: Scatterplot and global or local regression fits 


The resulting graph is given in the first panel of figure 25.3. The two 


horizontal dashed lines plot the DGP values y~ = —10 and y+ = 70. The 


vertical line plots the discontinuity at z = 0. The estimated TE is the vertical 
difference between the two fitted curves at x = 0. 


The following regression fits the same two quadratic curves and estimates 
the TE. 


. * SRD design example: ATE estimated by global quadratic 
. generate Dx = D*x 


. generate Dxsq = D*xsq 


. regress y D x xsq Dx Dxsq, vce(robust) 


Linear regression Number of obs = 500 
F(5, 494) = 338.35 
Prob > F = 0.0000 
R-squared = 0.7738 
Root MSE = 60.656 

Robust 
Coefficient std. err. t P>I|t| [95% conf. interval] 
70.17662 15.70607 4.47 0.000 39.31769 101.0355 
2.415727 . 7527344 3.21 0.001 .9367714 3.894683 
-.0128546 .0116833 -1.10 0.272 - .0358096 .0101004 
7923694 1.128322 0.70 0.483 -1.424533 3.009272 
-.030622 .0172868 -1.77 0.077 -.0645867 . 0033427 
-12.61842 10.84235 -1.16 0.245 -33.92122 8.684382 


The estimated TE is 70.18 and is quite precisely estimated. 


The first panel of figure 25.3 used global quadratic regression functions. A 
nonparametric regression is the preferred alternative because it avoids 
functional form assumptions that could potentially distort even informal 
inference about the cutoff. The following code instead fits the regression 
curves using local linear nonparametric regression, introduced in section 2.6.6, 
with a triangular kernel and plugin choice of bandwidth. 


* SRD design example: Scatterplot with separate local linear fits 

twoway (scatter y x, xline(0) yline(-10, lpat(dash)) yline(70, lpat (dash) ) 
msize(vsmall) xtitle("Running variable x") ytitle("Outcome") leg(off)) 
(lpolyci y x if D==1, kernel(triangle) bw(20) deg(1) lcolor (black) ) 
(lpolyci y x if D==0, kernel(triangle) bw(20) deg(1) lcolor(black)), 
title("RD: Scatterplot and local linear fits") 


VVVMVM: 


The second panel of figure 25.3 gives the resulting graph. The local linear 
curves are very similar to the quadratic fits, aside from at the lower and upper 
endpoints. Interest lies in the fitted curves at 7 = 0. Visually, the nonparametric 


estimates are similar to the quadratic estimates, being slightly below — 10 at 

x = 0 for the left curve and slightly below 70 for the right curve. The 
nonparametric estimates at 7 = 0 are less precise than the parametric quadratic 
estimates because the 95% confidence intervals at 7 = 0 are wider. 


The ATE estimate is the first fitted value for observations with x > 0 less the 
final fitted value for observation with x < 0, here the 50th fitted value because 
the default for 1poly is to fit at min(V, 50) points. The ATE estimate can be 
obtained as follows: 


. qui lpoly y x if D==0, kern(tri) bw(20) deg(1) gen(xminus yminus) 
. qui lpoly y x if D==1, kern(tri) bw(20) deg(1) gen(xplus yplus) 


. di "ATE = " yplus[1] " - " yminus[50] " = " yplus[1]-yminus [50] 
ATE = 58.223343 - -14.019219 = 72.242562 


Implementation of this nonparametric approach requires choice of kernel 
and choice of bandwidth and a method to calculate the standard error of the 
estimated ATE. The subsequent application uses community-contributed 
commands that automate this process. 


An alternative and popular nonparametric method for viewing the 
relationship between the outcome y and the running variable x is to form bins 
of x and calculate the mean of y in each bin. The following provides a 
scatterplot of the binned estimates, using 20 equal-spaced bins on each side of 
the treatment threshold, as well as the same local linear curves as those already 
given in the second panel of figure 25.3. 


. * SRD design example: Binned data with separate local linear fits 
. sum X 


Variable Obs Mean Std. dev. Min Max 


x 500 . 1132601 36.33044 -71.85318 69.72129 
. scalar lowbw = (0 - r(min))/20 
. scalar highbw = (0 + r(max))/20 
. generate xbin = lowbw*floor(x/lowbw)+ lowbw/2 


. replace xbin = highbw*floor(x/highbw)+ highbw/2 
(500 real changes made) 


. bysort xbin: egen ybinmean = mean(y) 


twoway (scatter ybinmean xbin, xline(0) xtitle("Bins of running variable x") 
msize(vsmall) ytitle("Mean of outcome in each bin of x") legend(off)) 
(lpoly y x if D==1, kern(tri) bw(20) deg(1) lcol(black) fcol(none)) 
(lpoly y x if D==0, kern(tri) bw(20) deg(1) lcol(black) fcol(none)), 
title("RD: Scatterplot of binned data") 


VVVWV: 


The first panel of figure 25.4 plots the fitted means and the fitted local 
linear curves. The outcome seems fairly continuous in the running variable, 
aside from the clear discontinuity in y at z = 0. An alternative to equal-spaced 
bins is to base the bins on quantiles of x. 
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Figure 25.4. RD: Binned data scatterplot and check of discontinuity 
in the running variable 


An important assumption is that there is no discontinuity in the running 
variable around the threshold value of the running variable. A visual check is to 
again form equal-spaced bins of x, calculate the number of observations in 
each bin, plot these counts and an associated fitted nonparametric curve against 
x, and check for a discontinuity at the threshold value of x. We have 


. * SRD design example: Visual check no discontinuity in x at x = 0 
. bysort xbin: egen xcount = count(x) 


twoway (scatter xcount xbin, xline(0) xtitle("Bins of running variable x") 
msize(vsmall) ytitle("Count of x in each bin of x") legend(off)) 
(lpoly xcount xbin if D==1, kern(triangle) bw(20) deg(1) 1col(black)) 
(lpoly xcount xbin if D==0, kern(triangle) bw(20) deg(1) lcol(black)), 
title("RD: Check discontinuity in running variable") 


VVVMVs 


The second panel of figure 25.4 suggests there is no discontinuity in the 
counts of x at the threshold 7 = 0. McCrary (2008) provides a formal 
nonparametric test based on this approach. 


25.7.3 rdrobust package and application 


The community-contributed rdrobust package of Calonico, Cattaneo, and 
Titiunik (2014) and Calonico et al. (2017) provides nonparametric estimates for 
various RD designs. There are three components. The rdp1ot command 


provides scatterplots of y against x with fitted global polynomial curves. The 
rdrobust command provides nonparametric estimates of the ATE using local 
polynomial regression, where bandwidths are data determined, and provides 
heteroskedastic—robust or cluster—robust standard errors for this estimate. A 
key component is determining the bandwidth of the local regression. The 
rdbwselect command provides a range of methods for determining 
bandwidths that involve a tradeoff between variance that decreases as 
bandwidth increases and bias that increases as bandwidth increases. 


These commands provide a very wide range of options, and users of these 
commands should read at least the two articles cited above. In the remainder of 
this section, we merely provide simple examples of these commands, with 
application to the same dataset as that used by Calonico et al. (2017). 


The application is to U.S. Senate elections from 1914 to 2010. The running 
variable margin is the Democratic Party’s margin of victory in a U.S. Senate 
seat in year t, and the outcome variable vote is the Democratic Party’s share of 
the vote in the subsequent election for the same seat, an election that usually 
occurs in year t + 6. The hypothesis under consideration is that there is an 
incumbent advantage, so that a narrow win (loss) in one election is likely to 
lead to a win (loss) in the subsequent election. This is an SRD design with cutoff 
at margin = 0. 


Unlike many panel-data applications, the estimators used here are pooled 
estimators without fixed effects. Also, heteroskedastic—robust standard errors, 
reported here, are similar to cluster—robust standard errors. See section 25.7.8 
for discussion. 


. * SRD: Cattaneo et al. U.S. Senate elections data 
. qui use mus225rdsenate, clear 


. Summarize 
Variable Obs Mean Std. dev. Min Max 
state 1,390 40.01367 21.99304 1 82 
year 1,390 1964.63 28.05466 1914 2010 
vote 1,297 52.66627 18.12219 (0) 100 
margin 1,390 7.171159 34.32488 -100 100 
class 1,390 2.023022 .8231983 1 3 
termshouse 1,108 1.436823 2.357133 (0) 16 
termssenate 1,108 4.555957 3.720294 1 20 


population 1,390 3827919 4436950 78000 3.73e+07 


The dataset includes additional variables, some of which are used below in 
validity tests. 


25.7.4 The rdplot command 


The community-contributed rdplot command provides a visual diagnostic that 
plots a variable y against the running variable x. If y is an outcome felt to be 
affected by the RD treatment, then the only discontinuity between y and x 
should be at the threshold value of x. If y is instead a variable unaffected by the 
RD treatment, such as a pretreatment variable, then there should be no 
discontinuity at the threshold value of x. 


The graph produced by rdplot has two components—a scatterplot of bin 
means of y against x and global polynomial regression curves fit separately on 
either side of the threshold value of x. 


The general syntax for the rdplot command is 
rdplot depvar runvar lif | lin] [ , options | 


The option c() specifies the cutoff or threshold value of the running variable, 
with default c (0). The option p() specifies the degree of the global 
polynomial, with default a fourth-degree polynomial. The option ci () adds 
pointwise confidence intervals for each bin, and the option shade shades these. 
The remaining options mostly determine the associated scatterplot that depends 
on the number of bins, the spacing of the bins, and the way that bin 
(conditional) means are computed. 


The option binselect () provides eight different data-driven methods for 
determining the number of bins based on three underlying binary choices. The 
bins of x may be either equal spaced or quantile spaced. The bin means may be 
calculated using either spacing estimators or local polynomial estimators with 
the kern() option, allowing uniform (the default), triangle, or Epanechnikov 
kernels. A spacing estimator is based on ordered data within each bin and does 
not require any tuning parameters; see Calonico, Cattaneo, and Titiunik (2015). 
The number of bins is chosen to either minimize integrated mean squared error 
(MSE) or lead to binned sample means that have variability approximately equal 
to the amount of variability of the raw data. The latter method, termed 
“mimicking variance”, leads to undersmoothing and greater variability than 


integrated MSE. The default binselect (esmv) uses spacing estimators of the 
mean in equally spaced bins whose number is chosen by the mimicking- 


variance methods. The option nbins () instead directly specifies the number of 
bins. 


We begin by providing a scatterplot of the relationship between the running 
variable margin, the Democratic Party’s margin of victory in an election for a 


U.S. Senate seat, and the preceding election (six years earlier) for the same 
seat. 


. * SRD: Scatterplot of the data 

. twoway scatter vote margin, msize(tiny) xline(0) 

> ytitle("Vote share at t+6") xtitle("Margin of victory at t") 
> title("Simple scatterplot of the data") 


The first panel of figure 25.5 shows that a higher margin of victory in one 
election leads on average to a higher vote share in the subsequent election. 
What is not clear, however, is whether there is an incumbency advantage that 
leads to a discontinuity at margin = 0, the threshold for election victory by the 
Democratic candidate. 
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Figure 25.5. SRD: Scatterplot and binned plot 


The rdplot command, with command defaults, for the same data yields the 
following results. The associated graph is given in the second panel of 
figure 25.5. 


. * SRD: rdplot command for the same data using command defaults 

. rdplot vote margin, c(0) 

> graph_options(title("rdplot command using command defaults") 

> ytitle("Vote share at t+6") xtitle("Margin of victory at t")) 


RD Plot with evenly spaced mimicking variance number of bins using spacings 
> estimators. 


Cutoff c = 0 | Left of c Right of c Number of obs = 1297 

Kernel = Uniform 
Number of obs 595 702 
Eff. Number of obs 595 702 
Order poly. fit (p) 4 4 
BW poly. fit (h) 100.000 100.000 
Number of bins scale 1.000 1.000 


Outcome: vote. Running variable: margin. 


Left of c Right of c 


Bins selected 15 35 
Average bin length 6.667 2.857 
Median bin length 6.667 2.857 
IMSE-optimal bins 8 9 
Mimicking Var. bins 15 35 


Rel. to IMSE-optimal: 


Implied scale 1.875 3.889 
WIMSE var. weight 0.132 0.017 
WIMSE bias weight 0.868 0.983 


The first set of output indicates that 595 observations were to the left of the 
cutoff or threshold of margin = 0, that 702 were to the right, and that the two 
curves are fourth-order polynomials that use the entire range of the running 
variable. The second set of output indicates that 15 equal-spaced bins were 
selected below the cutoff and 35 above. From the top of the output, the number 
of bins was selected using the mimicking variance method, and the bin means 
were computed using spacing estimators. The third set of output indicates that 
minimizing integrated MSE would lead to considerably fewer bins. 


The two fitted fourth-degree polynomial curves given in the second panel 
of figure 25.5 show a clear discontinuity at margin = 0, suggesting that there 
is an incumbency advantage. The bin means are consistent with this. Both the 
curves and binned means suggest an increasing relationship between vote and 
margin, aside from the five lowest bins. 


The global polynomials lead to the following estimate of the TE. 


. * SRD: ATE estimate using parametric fourth-order polynomials 
. generate d = margin > 0 


. qui regress vote i.d##c.margin##c.margin##c.margin##c.margin, vce(robust) 


. di "Parametric ATE = " %6.3f _b[1.d] " het-robust st. error = " %6.3f _se[1.d] 
Parametric ATE = 9.407 het-robust st. error = 1.659 


. qui regress vote i.d##c.margin##c.margin##c.margin##c.margin, vce(cluster state) 


. di "Parametric ATE = " %6.3f _b[1.d] " clu-robust st. error = " %46.3f _se[1.d] 
Parametric ATE = 9.407 clu-robust st. error = 1.749 


At the threshold of margin = 0, incumbency leads to an increase in the vote 
share in the subsequent election of 9.41 percentage points. This is a very large 
effect and is quite precisely estimated. The heteroskedastic—robust are similar 
to the cluster—robust standard errors. 


The limitation of this estimate of the TE is that it relies on correct functional 
form specification, and because all the data are used, it will be influenced by 
observations a long way from those in the neighborhood of margin = 0. 


25.7.5 The rdrobust command 


The community-contributed rdrobust command provides an estimate of ATE 
using only observations in the neighborhood of the cutoff value of the running 
variable. This estimate is a local-polynomial estimate using only data in the 
region of the cutoff and will vary with polynomial degree, kernel, and 
bandwidth. Furthermore, local polynomial estimators are biased, so a bias 
correction may be warranted. The command sets defaults for these various 
factors but allows the user great flexibility in departing from the defaults. 


The general syntax for the rdrobust command is 


rdrobust depvar runvar lif | lin] ls options | 


Table 25.3. Selected options of rdrobust command 


options Description 

c (cutoff) the RD cutoff in the running variable 

p(pvalue) order of local polynomial for estimation (default p(1)) 

q(qualue) order of local polynomial for bias correction (default q(2)) 

kernel (kernelfn) kernel function used (default kernel (triangular) ) 

bwselect(bwmethod) bandwidth selection procedure (default bwselect (mserd) ) 

vce (ucemethod) variance—covariance matrix estimator (default vce(nn 3)) 

all report three different ate and vce estimators 

cov (covars) additional covariates for estimation and inference 

fuzzy (fuzzyvar) the treatment variable used in fuzzy regression 
discontinuity (FRD) estimation 

deriv (dvalue) set dvalue to 1 for kink RD estimation 


Table 25.3 provides a partial list of the comprehensive options. 


Heteroskedastic-robust and cluster—robust standard errors can be obtained 
based on either nearest-neighbor methods or residual-based standard errors— 
see Calonico et al. (2017) and articles cited therein for details. The cov () 
option is unnecessary, but the inclusion of pretreatment variables as additional 
covariates can improve estimator precision. For the moment, we consider only 
SRD and defer discussion of the fuzzy () and deriv () options. 


Using command defaults, aside from adding the a11 option, we obtain 


. * SRD: rdrobust command using command defaults aside from option all 
. rdrobust vote margin, c(0) all 


Sharp RD estimates using local polynomial regression. 


Cutoff c = 0 | Left of c Right of c Number of obs = 1297 

BW type = mserd 

Number of obs 595 702 Kernel = Triangular 

Eff. Number of obs 359 322 VCE method = NN 
Order est. (p) 1 1 
Order bias (q) 2 2 
BW est. (h) 17.708 17.708 
BW bias (b) 27.984 27.984 
rho (h/b) 0.633 0.633 


Outcome: vote. Running variable: margin. 


Method Coef. Std. Err. Zz P>lz| [95% Conf. Interval] 
Conventional 7.416 1.4604 5.0782 0.000 4.55378 10.2783 
Bias-corrected 7.5099 1.4604 5.1425 0.000 4.64768 10.3722 


Robust 7.5099 1.7426 4.3095 0.000 4.09441 10.9255 


The command defaults use local linear regressions with a triangular kernel, 
the method generally favored by the literature. The key choice then is the 
bandwidth; see section 25.7.6 for a summary of the various options. The 
default bwselect (mserd) option leads to a large bandwidth so that 
observations with margin in the range (—17.708, 17.708) are included; this 
uses around one-half of the sample, so the efficiency loss of a local analysis 
should not be great. 


The reported conventional estimate of 7.416 equals (yt — y~), where, for 
example, y+ is the prediction at margin = 0 from local linear regression using 
observations with margin in the range (0, 17.708). The associated standard 
error of 1.4604 is computed using a heteroskedastic—robust estimate that by 
default is based on nearest-neighbor matching of residuals; the default uses 
three nearest neighbors. Alternative options use residual-based White standard 
errors, possibly with small-sample adjustment. The residual-based methods 
require a choice of bandwidth. The bandwidth used is the same as that chosen 
for estimation, a bandwidth that is not optimal for variance estimation and may 
lead to poor finite sample performance. Cluster—robust versions of these 
methods are also available. Recall that a parametric fourth-order global 
polynomial led to estimate 9.407 and standard error 1.659. 


Nonparametric estimators such as local polynomial are biased because the 
bandwidth is chosen to minimize MSE, which involves a tradeoff between bias 
and variance. The reported bias-corrected estimate of 7.510 involves a 
recentering based on an estimate of the bias that by default uses a second-order 
local polynomial and a bandwidth of 27.984 in this application. The 95% 
confidence interval is then 7.5099 + 1.960 x 1.4604. 


The bias-corrected estimate can perform poorly in finite samples because of 
noise in estimation of the bias. The reported robust estimate given in the final 
line of the output provides a larger standard error of 1.7426 that accounts for 
this additional source of estimation error. 


The preceding local linear estimate of the ATE can be represented 
graphically by using the rdplot command with bandwidths specified to equal 
those selected by the rdrobust command, bandwidths that are stored in e (h_1) 
and e(h_r). We have 


* SRD: rdplot of the local linear estimate of ATE obtained by rdrobust 
. qui rdrobust vote margin, c(0) 


. qui rdplot vote margin if -e(h_1)<= margin & margin <= e(h_r), 


> binselect(esmv) kernel(triangular) h(~e(h_1)° ~“e(h_r)~) p(1) 
> graph_options(title("RD plot of the default ATE estimate") 
> ytitle("Vote share at t+6") xtitle("Margin of victory at t")) 


The first panel of figure 25.6 shows the two local linear lines fit for 
observations with margin in the ranges (—17.708,0) and (0, 17.708). The 


estimated TE of 7.416 is the vertical distance between the two fitted lines at 
margin = 0. 
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Figure 25.6. SRD: ATE estimate using local linear and local quadratic 
regression 


As a cautionary tale regarding overfitting, the second panel of figure 25.6 
presents an alternative estimate that uses a narrower bandwidth and a local 
quadratic estimate. 


* SRD: rdplot of the local quadratic estimate with narrower bandwidth 
. qui rdplot vote margin if -10 <= margin & margin <= 10, 


> binselect(esmv) kernel(uniform) h(~-5° `57) p(2) 
> graph_options(title("RD plot with quadratic and narrower bw") 
> ytitle("Vote share at t+6") xtitle("Margin of victory at t")) 


For any given data application, the TE estimate will vary according to many 
estimator choices. There is general agreement that it is best to use local linear 
regression. The bandwidth choice then becomes crucial, and at a minimum, one 
should check the robustness of the estimate to variation in the bandwidth. 


The following illustrates how results change with departures from a 
reference estimate of local linear with a triangular kernel, bandwidths of 10 on 


either side of the cutoff, and standard errors based on nearest-neighbor 
matching of residuals. 


. * SRD: rdrobust estimates with different estimation settings 
. qui rdrobust vote margin, c(0) all h(10 10) rho(0.5) p(1) kernel (triangular) 


. estimates store Pitri 

. qui rdrobust vote margin, c(0) all h(10 10) rho(0.5) p(1) kernel(tri) vce(hcO) 
. estimates store sewhite 

. qui rdrobust vote margin, c(0) all h(10 10) rho(0.5) p(1) kernel (uniform) 

. estimates store Piunif 

. qui rdrobust vote margin, c(0) all h(10 10) rho(0.5) p(2) kernel(triangular) 

. estimates store P2tri 

. qui rdrobust vote margin, c(0) all h(15 15) rho(0.5) p(2) kernel(triangular) 

. estimates store wide 

. qui rdrobust vote margin, c(0) all h(5 5) rho(0.5) p(2) kernel (triangular) 


. estimates store narrow 


. estimates table Pitri sewhite Piunif P2tri wide narrow, b(%46.3f) se 
> stfimt(46.0f) stats(N_h_1 N_h_r ) 


Variable Pitri sewhite Piunif P2tri wide narrow 
Conventional 7.985 7.985 6.899 11.922 9.086 13.133 
1.838 1.831 1.722 2.718 2.241 3.594 

Bias-corre~d 11.922 11.922 10.390 14.896 12.659 10.619 
1.838 1.831 1.722 2.718 2.241 3.594 

Robust 11.922 11.922 10.390 14.896 12.659 10.619 

2.718 2.660 2.670 3.406 2.939 5.363 

N_h_1 245 245 245 245 319 128 

N_h_r 206 206 206 206 288 117 


Legend: b/se 


We focus on the conventional estimates. White standard errors are quite close 
to the nearest-neighbor standard errors. Changing to a uniform kernel reduces 
the TE from 7.985 to 6.899. The local quadratic estimate of 11.922 is much 
larger than the local linear and has a much larger standard error. A wider 
bandwidth leads to a larger estimate of 9.086 and a larger standard error, 
despite use of a greater fraction of the sample, indicating that the linear model 
does not fit as well over a wider range. A narrower bandwidth also leads to a 
larger estimate and a larger standard error. 


25.7.6 The rdbwselect command 


The community-contributed rdbwselect command selects the bandwidth used 
to estimate the RD TE. It is used within rdrobust and is also a stand-alone 
command. 


The command uses two general approaches to bandwidth selection, both of 
which are plugin methods rather than cross-validation. The first is a plugin 
formula that minimizes MSE and entails estimation of bias and variance. The 
second approach provides optimal coverage rates for confidence intervals and 
entails a rescaling of the first plugin formula. For details on these methods, see 
Calonico et al. (2017). Within each method, the bandwidth may be the same or 
may differ on either side of the cutoff, and the bandwidth may be chosen to be 
optimal for the sum of the regression estimates rather than the difference (the 
TE). 


The following example displays all available bandwidth selection methods. 


. * SRD: rdbwselect command showing all available bandwidth selection methods 
. rdbwselect vote margin, c(0) all 


Bandwidth estimators for sharp RD local polynomial regression. 


Cutoff c = 0 | Left of c Right of c Number of obs = 1297 

Kernel = Triangular 

Number of obs 595 702 VCE method = NN 
Min of margin -100.000 0.036 
Max of margin -0.079 100.000 
Order est. (p) 1 1 
Order bias (q) 2 2 


Outcome: vote. Running variable: margin. 


BW est. (h) BW bias (b) 

Method | Left of c Right of c | Left of c Right of c 
mserd 17.708 17.708 27.984 27.984 
msetwo 16.154 18.009 27 .096 29.205 
msesum 18.326 18.326 31.280 31.280 
msecombi 17.708 17.708 27 .984 27.984 
msecomb2 17.708 18.009 27 .984 29.205 
cerrd 12.374 12.374 27.984 27.984 
certwo 11.288 12.585 27.096 29.205 
cersum 12.806 12.806 31.280 31.280 
cercomb1 12.374 12.374 27 .984 27 .984 
cercomb2 12.374 12.585 27.984 29.205 


The various methods lead to bandwidths similar to those selected by the default 
mserd method. The biggest difference is that the cer methods lead to a smaller 
bandwidth below the cutoff. 


25.7.7 Fuzzy regression discontinuity design 


The sRD design assumes that on one side of the cutoff, all observations are 
untreated, while on the other side of the cutoff, all observations are treated. FRD 
design relaxes this sharp cutoff. If those above the cutoff are on average more 
likely to be treated, for example, then under FRD some of the individuals below 
the cutoff may be untreated or some above the cutoff may not be treated, or 
both. 


Consider the local estimation approach. Intuitively, the SRD TE yt — y~ 
needs to be scaled up by dividing by the fraction of the sample who move from 
untreated to treated in the neighborhood of the cutoff. This leads to the FRD TE 


t 


ATEPRD = F DF 


where D is a binary indicator of actual treatment status, 
Dr = limy—+¢+ E(D,\x; = c) and D- = limpe- E(D;\2; = c). 


The LATE framework of section 25.5 for binary treatment D and a binary 
instrument z is relevant. For FRD, the running variable x is essentially forming 
a binary instrument z that switches from 0 to 1 (or from 1 to 0) on either side 
of the threshold. As for LATE, we potentially have compliers, defiers, always- 
takers, and never-takers. We make the monotonicity assumption that D;(x) is 
nonincreasing (or nondecreasing) in x at x = c for all ; when treatment is more 
likely above (or below) the cutoff. This rules out defiers so that D+ — D- 
measures compliers. 


It follows that FRD measures the TE for compliers. As for LATE, this 
restrictive interpretation limits the possibility of extrapolating the conclusion 
beyond the complier subpopulation. 


The local FRD estimate can be obtained as the ratio of an estimate of the 
numerator, obtained by local polynomial estimation of y on x on either side of 
the threshold, to an estimate of the denominator, obtained by local polynomial 
estimation of D on x on either side of the threshold. These two components 


can be computed using the separate commands rdrobust y x and rdrobust D 


X. 


It is more convenient to use the fuzzy () option of the rdrobust command. 
The basic command is rdrobust y x, fuzzy (D), where D is the treatment 
indicator. This provides standard errors of the estimate and by default follows 
the recommended procedure of selecting the optimal bandwidth for the 
outcome local polynomial regression and using the same bandwidth for the 
treatment local polynomial regression. 


As illustration, we continue with the same dataset but change the treatment 
status. In SRD, no observations with margin < 0 are treated, and all 
observations with margin >= 0 are treated. Instead, we suppose 10% of 
observations with margin < 0 do get treated, and only 60% of observations 
with margin >= 0 are treated. We have 


. * FRD DGP: 10% below cutoff are treated and 40% above cutoff are treated 
. gen dtreat = margin > 0 


. set seed 10101 
. qui replace dtreat = 1 if (runiform() < 0.1 & margin < 0) 
. qui replace dtreat = 0O if (runiform() < 0.4 & margin > 0) 


The rdplot command can be used to plot the treatment indicator against 
the running variable using binned data and a fitted fourth-order polynomial 
curve. 


. * FRD: rdplot command for treatment indicator against margin 

. qui rdplot dtreat margin, c(0) 

> graph_options(title("Fuzzy rd: rdplot of treatment on running variable") 
> ytitle("Treatment indicator at t") xtitle("Margin of victory at t")) 


Figure 25.7 shows that around 10% of those below the cutoff were treated, 
and around 60% above the cutoff were treated. So one-half of the sample was 
compliers, switching from no treatment below cutoff to treatment above the 
cutoff. We earlier estimated y+ — y~ to equal 7.985, so we expect the FRD 
estimate for compliers to be approximately 7.985/0.5 = 15.97. 
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Figure 25.7. SRD: ATE estimate using local linear and local quadratic 
regression 


Using the fuzzy() option of rdrobust yields 


. * Fuzzy RD: rdrobust with fuzzy() option 
. rdrobust vote margin, c(0) fuzzy(dtreat) all 


Fuzzy RD estimates using local polynomial regression. 


Cutoff c = 0 | Left of c Right of c Number of obs = 1297 

BW type z mserd 

Number of obs 595 702 Kernel = Triangular 

Eff. Number of obs 380 338 VCE method = NN 
Order est. (p) 1 1 
Order bias (q) 2 2 
BW est. (h) 18.831 18.831 
BW bias (b) 32.219 32.219 
rho (h/b) 0.584 0.584 


First-stage estimates. Outcome: dtreat. Running variable: margin. 


Method Coef. Std. Err. Zz P>lizl [95% Conf. Interval] 
Conventional .4475 .06318 7.0831 0.000 . 323673 .571329 
Bias-corrected .44613 .06318 7.0613 0.000 . 322298 .569954 


Robust .44613 .07393 6.0346 0.000 . 301229 .591023 


Treatment effect estimates. Outcome: vote. Running variable: margin. Treatment Sta 
> tus: dtreat. 


Method Coef. Std. Err. Zz P>|zIl [95% Conf. Interval] 
Conventional 16.405 4.0324 4.0682 0.000 8.50133 24.308 
Bias-corrected 16.569 4.0324 4.1091 0.000 8.66616 24.4728 
Robust 16.569 4.7527 3.4863 0.000 7.25428 25.8847 


A first-order local polynomial is fit with the same bandwidth for the outcome 
regression and the treatment regression. The first-stage conventional estimate 
of 0.4475 is an estimate of D+ — D-. The command does not provide output 
on yt — y7, but it is 7.3412, slightly different from the earlier SRD estimate 
because the bandwidth is slightly different. This leads to the treatment 
conventional estimate of 7.3412/0.4475 = 16.405. The TE for compliers is 
16.405. 


25.7.8 Further discussion 


This application has applied a pooled regression to a state panel dataset. In 
many panel applications, fixed effects are used to identify a causal effect. In an 
RD design, however, it is unnecessary for identification to include unit-specific 
fixed effects. 


Also, with a state-year panel, it is standard to obtain cluster—robust standard 
errors, with clustering on state. Here the cutoff indicator variable DP has little 


within-state correlation, and adding option vce (cluster state) to the 
rdrobust command increases standard errors by around 5%. 


A useful check of the RD design is to perform placebo tests that replace the 
outcome variable with other variables that should not show a jump in the 
running variable at the cutoff. 


Here we do so using the population in the state and using the Senate vote in 
an earlier election. Senate elections are every two years on a rotating 6-year 
cycle, and variable vote is measured 6 years after variable margin; so to 
compute the vote six years before variable margin, we need to lag variable 
vote by 12 years or 6 2-year periods. We have 


* SRD: Placebo test 
. qui rdplot population margin, c(0) 
> graph_options(title("Placebo rdplot using population") 
> ytitle("Population at t + 6") xtitle("Margin of victory at t")) 


. tsset state year 


Panel variable: state (unbalanced) 
Time variable: year, 1914 to 2010, but with gaps 
Delta: 1 unit 


. generate votelagged = 0 


. replace votelagged = 16.vote // Six two-year periods ago = 12 years 
(1,377 real changes made, 256 to missing) 


. qui rdplot votelagged margin, c(0) 
> graph_options(title("Placebo rdplot using lagged vote") 
> ytitle("Vote share at t - 6") xtitle("Margin of victory at t")) 


Figure 25.8 shows no jump at margin > 0, and the curves have the 
expected shape with little relationship between population and margin and an 
increasing relationship between the lagged vote and future margin of victory. 
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Figure 25.8. sRD: Placebo plots 


The SRD and FRD designs consider a vertical jump in the running variable at 
the cutoff. An extension, called kink RD design, instead allows for a change in 
slope in the relationship between the outcome and the running variable. For 
details see Card et al. (2015). The deriv(1) option of the rdrobust command 
estimates the TE in a kink RD design (up to scale), and the additional fuzzy () 
option estimates a fuzzy kink RD design. Identification relies on L’Hopital’s 
rule and no jump in the first stage, and the data demands for kink RD are much 
greater than those for SRD. 


25.8 Conditional quantile regression with endogenous regressors 


A major attraction of quantile regression (QR), introduced in chapter 15, is 
that it permits responses to changes in a key policy variable to differ across 
individuals. For example, a training program may have greater effect for 
individuals with lower conditional quantile of earnings. 


In many such applications, the policy variable is endogenous. For 
example, individuals may self-select into the training program. The 
extension from linear model Iv to quantile Iv is challenging. Stronger 
assumptions are needed, different proposed methods make different 
assumptions, some methods are computationally difficult to implement, and 
this is still an active area of research. Some leading methods are based on the 
TEs methods for binary treatments. The discussion here is very brief; see the 
original articles for complete details. 


25.8.1 Local conditional quantile TE with endogenous binary regressor 


We consider estimation of the quantile treatment effect (QTE) with ET that is 
binary and an instrument that is also binary. 


Let y denote a continuous outcome, x denote control variables, D denote 


a binary treatment that takes value 0 or 1, and z denote a binary instrument 
that takes value 0 or 1. Interest lies in estimating the conditional QTE 


A,(x) = Qq(y|D = 1,x)- Qq(y|D = 0, x) 


If D and x are exogenous, then this is easily estimated as the coefficient of 
D following conditional quantile regression (CQR) of y on x and D that 
minimizes 


Q(B, 5a) = Smt Dið4 — xB) 


where p,(-) is the check function p,(u) = u{q — 1(u < 0)} introduced in 
section 15.2.1. 


Instead, we consider a binary treatment D that is endogenous. Abadie, 
Angrist, and Imbens (2002) proposed a local average QTE estimator that is an 
extension to quantile regression of the LATE estimator of Abadie (2003) for 
linear regression presented in section 25.5.3. 


For simplicity, consider the case that an increase in the instrument pushes 
individuals toward treatment. Ideally, treatment D = 1 when z = 1; such 
individuals are called compliers. It is possible that instead D = 0 when 
z = 1; such people are called defiers. The following analysis assumes that 
there are no defiers (individuals for whom D = 0 when z = 1) after 
additionally conditioning on x; an assumption also called the monotonicity 
assumption. 


The binary instrument z is assumed to satisfy four conditions, all of 
which apply after conditioning on x, that we briefly summarize as 1) z is 
independent of y and D; 2) z has no direct effect on y; 3) z is a relevant 
instrument; 4) there are no defiers. 


The method of Abadie, Angrist, and Imbens (2002) controls for the 
endogeneity of D by using a weighted version of CQR. The local average _ 
conditional QTE at the qth quantile, A4, is estimated by Oy where Ôq and B, 


minimize the weighted sum 


S(B8,, Ôa) = Rt D;ô — X; bq) 


where the weights R; are consistent estimates of the same kappa weights «Ki 
as defined in (25.9). Note that this reduces to the usual CQR estimator defined 
in chapter 15 if k; = 1 for all į. 


25.8.2 The ivqte command 


The community-contributed ivqte command, due to Frölich and 

Melly (2010), implements the preceding estimator as well as estimators for 
unconditional QTEs that are presented in the subsequent section. Running this 
command requires prior installment of the community-contributed moremata 
and kdens packages. 


The command is restricted to a binary treatment and binary instrument; a 
working paper version of Frölich and Melly (2013) discusses adaptation to 
nonbinary instrument or to multiple instruments. The syntax for the 
command varies with whether treatment is endogenous or exogenous and 
whether a conditional or unconditional QTE 1s desired. The default is to not 
compute the variance of estimated coefficients. The variance option 
computes the variance using an analytical formula that allows for 
heteroskedasticity. This variance estimate uses kernel methods, and it may 
be necessary to change some related settings from their default values. 
Alternatively, a bootstrap can be used. 


We revisit the CQR application of section 15.3 that uses data from the 
Medical Expenditure Panel Survey. The dependent variable 1totexp is the 
log of total medical expenditure by the Medicare elderly. The explanatory 
variables are an indicator for supplementary private insurance (suppins), 
one health-status variable (totchr), and three sociodemographic variables 
(age, female, white). We consider estimation at the median and use the 
ivgte command; the same estimates are obtained using areg. The syntax of 
the ivqte command requires that a treatment variable be explicitly identified 
by being placed in parentheses and that it appear last. We initially treat all 
regressors as exogenous and obtain 


. * Conditional QTE of exogenous suppins using ivqte (same as qreg) 
. qui use mus203mepsmedexp, clear 


. drop if ltotexp ==. 
(109 observations deleted) 


ivqte ltotexp totchr age female white (suppins), quantile(0.5) variance 


Quantile regression 
Estimator suggested in Koenker and Bassett (1978) 


Quantile: .5 
Dependent variable: ltotexp 
Regressor(s): suppins totchr age female white 
Number of observations: 2955 
ltotexp | Coefficient Std. err. Zz P>lz| [95% conf. interval] 
suppins .2769771 .0572835 4.84 0.000 . 1647034 . 3892508 
totchr . 3942664 .0215262 18.32 0.000 . 3520758 . 436457 
age .0148666 . 0043557 3.41 0.001 . 0063296 . 0234036 
female - . 0880967 . 0564343 -1.56 0.119 -. 1987058 .0225124 
white . 4987457 . 2051237 2.43 0.015 .0967107 . 9007807 
-cons 5.648891 . 3825663 14.77 0.000 4.899075 6.398707 


This standard conditional median regression yields a conditional QTE at the 
median of 0.277. 


We next treat suppins as endogenous, with binary instrument marry, 
which is an indicator of marital status; about 56% of the sample are married. 
This example is illustrative; it makes the questionable assumption that marry 
can be excluded from the model for 1totexp. 


The local average conditional QTE can be obtained using the aai option 
of the ivqte command. A local logit kernel regression (see section 17.8.4) is 
used to obtain p(z;), which appears in the weights defined in (25.9). The 
kernel varies with the type of regressor, so the syntax of the ivqte command 
requires distinction between continuous, binary, and ordered discrete 
regressors. We use the default choices of kernel (), bandwidth() for 
continuous regressors, and lambda () for binary regressors. The default is to 
restrict the predicted probabilities p(z;) to the interval (0.001, 0.999), which 
can lead to weights as large as 1,000 being given to some observations. The 
trim() option changes this default value. 


We obtain 


. * Conditional QTE of endogenous suppins using ivqte with aai option 


. ivqte ltotexp (suppins=marry), dummy(white female) continuous(totchr age) 
> quantile(0.5) var trim(0.01) generate_p(predprob) aai 


IV quantile regression 


Estimator suggested in Abadie, Angrist and Imbens (2002) 


Quantile(s): .5 
Dependent variable: ltotexp 
Treatment variable: suppins 
Instrumental variable: marry 
Control variable(s): 

Number of observations: 2955 
Proportion of compliers: .063 


totchr age white female 


Propensity score estimated by local logit regression with h = infinity and 


> lambda = 1 


Positive weights estimated by local linear regression with h = infinity and 


> lambda = 1 


Variance estimated using local linear regression with h = infinity and lambda 


ltotexp | Coefficient Std. err. z P>|zl [95% conf. interval] 
suppins 1.179383 . 5203996 2.27 0.023 . 1594189 2.199348 
totchr . 4590747 . 1834216 2.50 0.012 .0995749 .8185745 
age -.0087295 .0572944 -0.15 0.879 -.1210245 . 1035655 
white . 2838135 1.349665 0.21 0.833 -2.361482 2.929109 
female - . 3344497 1.129468 -0.30 0.767 -2.548167 1.879267 
_cons 7.616818 5.273967 1.44 0.149 -2.719968 17.9536 


The conditional local QTE estimate is 1.179, exceptionally large and more 
than four times larger than the estimate assuming exogeneity of suppins. 


The associated standard error is nine times larger, increasing to 0.520, in part 


because the correlation between suppins and the instrument marry is low. 
Note also that the proportion of compliers is only 0.063, so this illustrative 
example is not well suited to this method. The predicted probabilities p(z;), 
saved in the variable named predprob, range from 0.141 to 0.918, so in this 


example, there was no need to trim. 


From additional regressions, the conditional local QTE estimate is 1.374 
at q = 0.25 and 1.032 at q = 0.75. These local QTE estimates apply to 


compliers. 


25.8.3 Alternative estimators of conditional QTE with endogeneity 


The preceding model is quite restrictive. For example, the effect of treatment 
might be interactive with some regressors, in which case multiple 


1 


instruments would be needed. Several other estimators of the conditional QTE 
have been proposed. These are based on alternative models and assumptions 
and are not yet widely used, in part because of computational challenges. 


Consider linear CQR with regressors d and x. At the qth quantile, the 
parameters minimize $(3,,6q) = y Pq(yi — did — X; Bq). When d is 
instead endogenous and instruments z are available, Chernozhukov and 
Hansen (2008) provide theory that leads to the following objective function, 


N 
P 
S(8,,54: Yq) = ; Wipq (v z dð; > X; b fa d; Ya) 
i=l 


where in the simplest case w; = 1 for all į and the additional variables q, are 
obtained by least-squares projection of d; on Z; and x;; even more simply, 
one might use Z;. 


The authors propose obtaining an estimate of 0, as follows. First, for 
each of a range of values of 6,, run the ordinary CQR to obtain estimates 
B, (6,) and ¥,(6q). Then, find that value of 6, that makes the coefficient of 
the added variables q, as small as possible by minimizing a quadratic form 
in Y,(6,), where the weighting matrix is the inverse of asymptotic variance 
of ¥,(6,). The method permits overidentified models and provides weak 
identification robust inference. 


This estimator, often called Iv QR, relies in part on the assumption of rank 
similarity, detailed below in section 25.8.4. It extends the original work of 
Chernozhukov and Hansen (2006). Recent work by Franguridi, Gafarov, and 
Wiithrich (2020) provides finite sample theory. 


CQR methods are motivated by distributional analysis, but the underlying 
data are often censored, often from below at zero, as is the case with 
individual expenditure data. Then we observe y; = max(y;,c), where c is 
the censoring point and interest lies in the quantiles Q,(y*|x) = xB. 


Chernozhukov et al. (2019) propose an alternative method for CQR with a 
continuous endogenous regressor, one that can additionally allow for 
censoring; see section 15.3.11. The community-contributed car command 
(Chernozhukov et al. 2019) implements these methods. 


Machado and Santos Silva (2019) provide yet another approach that 
restricts attention to the location-scale linear regression model 


Yi = XB + g(wiy) X ui 


where w; are transformations of x; and g(-) > 0 is a specified function. 
Then conditional quantiles are linear in x with 


Qaa) =x a + Rul) 


In this model, 8, Y, and the qth conditional quantile of the error given x 
determine the qth conditional quantile of y given x. The assumptions that u; 
is independent and identically distributed with E(u) = 0 and the 
normalization that (|u|) = 1 yield two moment conditions that enable 
estimation of 3 and y by two OLS regressions. The quantile Q,(u,;|x;) can 
then be estimated from the scaled errors (Yi — x! 3) /g(wi4). 


Machado and Santos Silva (2019) extend this approach to the case where 
some of the regressors are endogenous and instruments are available. Then 
the model is qualitatively similar to the model of Chernozhukov and 
Hansen (2008), being more flexible by allowing nonlinear quantile effects 
but more restrictive because Chernozhukov and Hansen (2008) essentially 
allow for random coefficients. Estimation is much simpler than the method 
proposed by Chernozhukov and Hansen (2008) and can be performed using 
the community-contributed program ivqreg2, due to Machado and Santos 
Silva (2019). 


25.8.4 Rank invariance and rank similarity 


Interpreting QTE when treatment is not randomly assigned requires additional 
assumptions. The local conditional QTE relies on the assumption of 
monotonicity of treatment choice. Other methods can require rank invariance 
or rank similarity. 


For a binary treatment, let Y, and Yo denote the potential outcomes 
under, respectively, treatment or no treatment; for a given individual, we 
observe only one of Y; or Yọ. The rank of an individual is given by the 
cumulative distribution function (c.d.f.), which for a continuous random 
variable takes values that are uniformly distributed on (0, 1). Define the 
potential rank if treated as U; = F(Y;) and the potential rank if not treated 
as Uo = Fo(Yo). Then U; ~ U(0,1) and Up ~ U(0, 1). 


Let potential outcomes depend on observables x and an unobservable 
scalar v, so Y} = gi(x,v) and Yo = go(x, v). Then the potential ranks are 


Uı = Fi {gi(x, v)} and Up = Fo{go(x, v)}. 


Under rank invariance, also called rank preservation, after one controls 
for observables and a scalar unobservable, a person’s potential rank is the 
same with treatment or without treatment. Thus, U,|(x,v) = Uo|(x, v) for 
all (x, v). This in turn implies that, conditioning on the observables x alone, 
U,|x and Uo|x have identical distribution for all x. 


Rank similarity weakens this last result to say that, conditioning on both 
observables x and the unobservable v, U;|(x, v) and Up|(x, v) have identical 
distribution for all (x, v). This allows for random departures from rank 
invariance, qualitatively similar to a student with given ability being ranked 
differently from test to test, due purely to randomness in performance from 
test to test. 


Melly and Wiuthrich (2017) provide a discussion of the role in QR of 
assumptions of rank invariance and rank invariance and also monotonicity 
assumptions. Dong and Shen (2018) present nonparametric tests for rank 
invariance and for rank similarity. 


25.9 Unconditional quantiles 


An important distinction is that between the ME of changing a regressor on 
features of the conditional distribution of y|x and the ME of changing a 
regressor on features of the unconditional distribution of y. The presentation 
of QR in chapter 15 and section 25.8 so far has considered the former case, 
but the latter case can also be of great interest for policy makers. 


The literature on unconditional quantiles has focused on the TEs case, 
where interest lies in the ME of changing a binary treatment. For example, 
interest may lie in the effect of a training program on the 25th percentile of 
earnings, rather than on the 25th percentile of earnings conditional on 
control variables. This is called the unconditional QTE. 


For linear regression, the conditional and unconditional MEs are the 
same, equaling 3. For quantile regression, there is a difference. As pointed 
out by Frölich and Melly (2010), 


The interpretation of the unconditional effects is slightly different from 
the interpretation of the conditional effects, even if the conditional QTE 
is independent from the value of X. This is because of the definition of 
the quantile. For instance, if we are interested in a low quantile, the 
conditional QTE will summarize the effect for individuals with relatively 
low Y even if their absolute level of Y is high. The unconditional QTE, 
on the other hand, will summarize the effect with a relatively low 
absolute Y. 


Borah and Basu (2013) discuss in detail the distinction between 
conditional and unconditional quantile regression. 


In this section, we present several methods. The first specializes to the 
case of binary treatment D and computes the unconditional QTE as the 
difference in weighted quantile estimates according to whether D = 1 or 
D = 0, where the weights are determined by the probability of treatment 
given regressors x. The second approach uses influence functions that 
consider only small changes in the treatment. The third approach estimates 
semiparametric estimates of conditional c.d.f.’s or conditional quantiles and 


then integrates over regressors x. The section concludes by extending the 
first approach to allow treatment D to be endogenous. 


In the current example, the explanatory power of the regressors is low, 
with an R2 from OLS regression of 0.19. So we might expect unconditional 
QTEs to be not too different from conditional QTEs; both depend greatly on 
variation due to unobservables. 


25.9.1 Unconditional QTE using inverse-propensity score weighting 


Firpo (2007) estimates the unconditional QTE when the effects of a binary 
treatment vary across individuals by using inverse-propensity score 
weighting, a standard method used in the TEs literature; see chapter 24. It is 
assumed that regressors control for any selection effects and that the 
treatment is rank preserving, so that if an individual is in the qth quantile 
without treatment, he or she is also in the qth quantile with treatment. The 
estimation method actually does not require QR regression. 


We wish to estimate A, = Q,(y|D = 1) — Q,(y|D = 0), controlling 
for regressors x. Let p(x;) be an estimate of Pr( D; = 1|x;), the probability 
of treatment. For a treated observation, we weight by the inverse of the 
predicted probability of treatment, while for a nontreated individual, we 
weight by the inverse of the probability of nontreatment. 


Specifically, from section 15.1, the gth raw quantile “q minimizes 
ea 1 Pq(Yi — ag), Where pq(-) is the check function. For observations with 
D; = 1, we estimate Q,(y|D = 1) by Q1,, which minimizes the weighted 
Bure san W1iPq(Yi — Qq) Where wi; = 1/p(x;). For Qg(y|D = 0), we 
use Gg, Which minimizes Y>, wo;pq(ys — aog)» Where 
woi = 1/{1 — plxi)}- 

Combining, we equivalently estimate A, by (@1, — Qoq) where 1, and 
Qog Minimize 

(25.11) 


N 
Q(Q1q, 19g) = So wipal yi — Diag — (1 — Di)aog} 
i=1 


where the inverse-propensity score weights are 


Di 1- D; 
p(xi) 1—plxi) 


This estimator can be obtained using the ivqte command (Frölich and 
Melly 2010), introduced in section 25.8.2. The command syntax is similar to 
that presented for local conditional QTE estimation, except that the aai option 
is dropped and an instrument is not relevant. We obtain 


. * Unconditional QTE of suppins on ltotexp using community-contributed 
> command ivqte 

. ivqte ltotexp (suppins), dummy(white female) continuous(totchr age) 

> quantile(0.25(0.25)0.75) variance trim(0.1) 


Unconditional Quantile Treatment Effects under exogeneity 
Estimator suggested in Firpo (2007) 


Quantile(s): .25 .5 .75 

Dependent variable: ltotexp 

Treatment variable: suppins 

Control variable(s): totchr age white female 
Number of observations: 2955 


Propensity score estimated by local logit regression with h = infinity and lambda 
>=1 
Variance estimated using local logit regression with h = infinity and lambda = 1 


ltotexp | Coefficient Std. err. Zz P>lz| [95% conf. interval] 
Quantile_1 . 3989425 .074762 5.34 0.000 . 2524117 . 5454732 
Quantile_2 . 2826729 . 0643581 4.39 0.000 . 1565333 . 4088124 
Quantile_3 . 2005053 . 073077 2.74 0.006 .057277 . 3437335 


The unconditional QTEs of suppins in this example are similar to the 
conditional QTEs obtained by standard CQR using the qreg command; the 
biggest difference is at the upper quartile. 


. * Corresponding conditional QTE of suppins on ltotexp using qreg command 
. forvalues i = 1/3 { 


2. local j = “i°/4 
qui greg ltotexp suppins totchr age white female, quant(~j°) vce(robust) 
di "q=" `j?" =" _b[suppins] " se = " _se[suppins] 


.25 b = .38583946 se = .05991632 
.5 b= .27697708 se = .05346786 
.75 b = .14885476 se = .06202515 


1o pU 


HQ Q Q 


25.9.2 Unconditional QTE using recentered influence functions 


quantile effect that can be simply obtained by OLS regression with dependent 
variable recentered influence function (RIF) for the qth quantile, where the 
influence function measures the relative influence of individual observations 
on the value of a statistic. 


Specifically, the RIF at Yq, the qth quantile of the dependent variable with 


value y, is 


g=- ys Yq) 
fy (Ya) 


RIF (y; Yq) = Yq 4 


where fy (yq) is the density of Y at Yq. 


An attraction of RIF regression is that this measure of the unconditional 
QTE can be used for continuous regressors, not just discrete regressors. The 
disadvantage is that it is akin to a derivative that considers only local 
changes in the regressors. 


The RIF-based estimate can be implemented using the authors’ rifreg 
command. Results depend crucially on the estimate of the density fy (yq). 
The rifreg command uses the kdensity command with default bandwidth 
the “optimal” width used by kdensity and with default kernel the Gaussian. 
Users should save and plot the kernel density estimates to see whether they 
appear reasonable. 


We first obtain estimates for the 25th percentile of 1totexp, using the 
generate() option to save the kernel density estimates. 


* Unconditional QTE of suppins at median using community-contributed 
> command rifreg 
. rifreg ltotexp suppins totchr age female white, quantile(.25) 
> generate(yval kdensval) 


Source SS df MS Number of obs = 2955 
FC 5, 2949) = 115.81 
Model 1348.61919 5 269.723837 Prob > F = 0.0000 
Residual 8580.69514 2949 2.90969655 R-squared = 0.1358 
Adj R-squared = 0.1344 
Total 9929.31432 2954 3.36131155 Root MSE = 1.7058 
Robust 
rif_25 Coefficient std. err. t P>|t| [95% conf. interval] 
suppins . 3996899 .065752 6.08 0.000 . 2707653 . 5286144 
totchr .4758505 .0218392 21.79 0.000 . 4330288 .5186722 
age .0169063 .0050117 3.37 0.001 .0070796 .0267331 
female -.0100766 .0639374 -0.16 0.875 -. 1354431 .11529 
white .5814719 . 2226578 2.61 0.009 . 1448915 1.018052 
_cons 4.357429 . 4249414 10.25 0.000 3.524217 5.190641 


rifreg provides unconditional quantile estimates for all regressors. These 
are similar to the conditional estimates obtained from qreg, aside from 
female, Which is highly statistically insignificant, and white. The command 
scatter kdensval yval gives a plot that suggests the kernel density 
estimates are reasonable. 


Focusing on just the QTE for suppins at the quartiles, we obtain 


* Unconditional QTE of suppins at quartiles using community-contributed 
> command rifreg 
forvalues i = 1/3 { 


2. local j = “i°/4 

3. qui rifreg ltotexp suppins totchr age white female, quantile(`j”) 
4. di "q=" `j7 " b=" _b[suppins] 

5. 


.25 b = .39968985 
. . 27289201 
.75 b = .16171682 
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By comparison, from section 25.9.1 the method of Firpo (2007) yielded 
estimated QTEs of, respectively, 0.399, 0.283, and 0.201. 


The RIF regression approach extends naturally to estimation of 
unconditional QTEs in FE models because one can perform FE regression with 
dependent variable the RIF for the qth quantile. This can be implemented 
using the community-contributed command xtrifreg (Borgen 2016). 


25.9.3 Counterfactual distributions 


Chernozhukov, Fernandez-Val, and Melly (2013) estimate conditional 
distributions and regressor distributions and combine these to obtain 
counterfactual distributions of many features, among other things, quantile 
effects, distribution effects, and Gini coefficients. The community- 
contributed cdeco and counterfactual commands (Chernozhukov, 


Fernandez-Val, and Melly 2009) implement these methods. 


Key is obtaining an estimate of the c.d.f. of the outcome conditional on 
regressors. A parametric model for the c.d.f. can be used, but then results 
depend on very strong assumptions. The c.d.f. can be estimated 
nonparametrically (see Li and Racine [2007, chap. 6]), but there is then the 
curse of dimensionality. 


Instead, the method() option of the cdeco command offers several other 
ways to estimate the conditional c.d.f. Specifying gr estimates many 
conditional quantiles that are then inverted to obtain the c.d.f. Specifying a 
logit (or probit) model fits a series of models for Pr(y > y| D, x) at 
various conditional quantiles Yq; the default is yg = 0.01,0.02,...,0.99. 
Specifying cox estimates many Cox proportional hazards models from 
which the c.d.f. can be recovered. Specifying loc fits a location model, and 
specifying locsca estimates a location-scale model. Inference is based on a 
bootstrap. 


We obtain conditional QTEs of the effect of suppins on ltotexp at 
quantiles 0.25, 0.5, and 0.75, controlling for totchr, age, white, and 
female. We use method (locsca) because it is much faster computationally 
than, in particular, method (qr). We obtain 


* Unconditional QTE of suppins on ltotexp using community-contributed 
> command cdeco 
set seed 10101 


. cdeco ltotexp totchr age white female, group(suppins) method(locsca) 
> nreg(100) reps(400) quantile(0.25(0.25)0.75) 
(bootstrapping: -sese saaie t e da wade less aed aed ee OME ees Meee tak See ees Bees 


Conditional model location scale model 
Number of regressions estimated 100 


The variance has been estimated by bootstraping the results 400 times. 


No. of obs. in the reference group 1207 
No. of obs. in the counterfactual group 1748 


Differences between the observable distributions (based on the conditional model) 


Quantile Pointwise Pointwise Functional 
Quantile effect Std. Err. [95% Conf. Interval] [95% Conf. Interval] 
<25 -.362816 .063507 - . 487287 - . 238344 -.499034 -.226598 
.5 -. 247377 .054152 -.353514 -.14124 -.36353 -.131224 
.75 -.137208 .055354 -.245701 -.028716 -.255939 -.018478 


Effects of characteristics 


Quantile Pointwise Pointwise Functional 
Quantile effect Std. Err. [95% Conf. Interval] [95% Conf. Interval] 
.25 . 002096 .028978 -.0547 .058892 - .066256 .070447 
5 -.013416 .027461 -.067239 . 040406 -.078189 .051357 
.75 -.023122 .028571 -.07912 032876 -.090513 .044269 


Effects of coefficients 


Quantile Pointwise Pointwise Functional 
Quantile effect Std. Err. [95% Conf. Interval] [95% Conf. Interval] 
25 -.364911 .061113 -.484691 -.245132 -.504837 -.224986 
5 -.233961 .051964 -.335809 -.132113 -.35294 -.114982 


.75 -.114086 056582 -.224984 -.003188 - .243637 015464 


Bootstrap inference on the counterfactual quantile processes 


P-values 

Null-hypothesis KS-statistic CMS-statistic 
Correct specification of the parametric model 0 0025 .0075 
Correct specification of the parametric model 1 0 0 
Differences between the observable distributions 

No effect: QE(tau)=0 for all taus (0) 0) 

Constant effect: QE(tau)=QE(0.5) for all taus .0025 (0) 

Stochastic dominance: QE(tau)>0 for all taus (0) (0) 

Stochastic dominance: QE(tau)<0 for all taus .6825 .6825 
Effects of characteristics 

No effect: QTE(tau)=0 for all taus .5975 .6675 

Constant effect: QE(tau)=QE(0.5) for all taus .2725 .27 

Stochastic dominance: QE(tau)>0 for all taus 31 .33 

Stochastic dominance: QE(tau)<0O for all taus .56 .56 
Effects of coefficients 

No effect: QE(tau)=0 for all taus (0) (0) 

Constant effect: QE(tau)=QE(0.5) for all taus 0) 0) 

Stochastic dominance: QE(tau)>0 for all taus (0) (0) 

Stochastic dominance: QE(tau)<0 for all taus . 7075 . 7075 


The first set of output gives the conditional QTEs that have reverse sign to 
those found in the preceding examples, because the cdeco command sets the 
reference group to suppins=1, but are of similar magnitude. For example, 
the method of Firpo (2007) yielded QTEs of, respectively, 0.399, 0.283, and 
0.149, and the RIF method yielded values of, respectively, 0.399, 0.273, and 
0.162. The output reports both pointwise confidence intervals and uniform 
confidence intervals. The cdeco results do vary numerically, though not 
qualitatively, with the method() option used; method (logit) has the 
attraction of being relatively fast to compute compared with method (qr). 


The next set of tables decomposes the QTEs into the effects of 
characteristics and the effects of coefficients. The effect of variation in 
characteristics across individuals is not great. 


The final table presents a variety of specification tests. The first tests 
reject the underlying location-scale model at significance level 0.05, so other 
methods for estimating the conditional c.d.f. might be used. 


25.9.4 Unconditional QTE with endogenous discrete binary regressor 


Frölich and Melly (2010, 2013) provide an estimator for the case in which 
the binary treatment variable D is endogenous and a binary instrumental 
variable z is available. Several assumptions are made, including that the 
regression controls for all possible confounders. 


The objective function is the same as (25.11) for unconditional QTE with 
an exogenous treatment, except the weights are now given by 


o Zi — plz) 
“= Fe) — Rao 


where p(z;) is a prediction of Pr(z; = 1|x;). 


This model can be fit by the ivqte command; the command syntax is as 
for conditional QTE with a binary instrument, but the option aai is dropped. 


. * Unconditional QTE of endogenous suppins using ivqte 
. ivqte ltotexp (suppins=marry), dummy(white female) continuous(age totchr) 
> variance trim(0.01) quantile(0.5) 


Unconditional Quantile Treatment Effects under endogeneity 
Estimator suggested in Froelich and Melly (2008) 


Quantile(s): 5 

Dependent variable: ltotexp 

Treatment variable: suppins 

Instrumental variable: marry 

Control variable(s): age totchr white female 
Number of observations: 2955 

Proportion of compliers: O77 


Propensity score estimated by local logit regression with h = infinity and lambda 
>= 1 
Variance estimated using local logit regression with h = infinity and lambda = 1 


ltotexp | Coefficient Std. err. Zz P>lz| [95% conf. interval] 


Quantile_1 .5061746 . 8248461 0.61 0.539 -1.110494 2.122843 


At the median, the TE due to supplementary insurance is nearly 51%, but the 
confidence interval is very wide and includes 0. A better instrument is 
needed, as is also clear from the proportion of compliers being only 0.077. 


Repeating this estimation for the other quartiles, we obtain estimated 
unconditional local QTEs of, respectively, 0.087, 0.506, and 1.292 at the 0.25, 
0.5, and 0.75 quantiles. By comparison, the corresponding estimated 
conditional local QTEs with endogeneity were 1.374, 1.179, and 1.032. 


25.10 Additional resources 


The treatment evaluation methods of this chapter are presented in detail in 
Todd (2022); see also Cameron and Trivedi (2005, chap. 25), 

Wooldridge (2010, chap. 21), and Hansen (2022, chaps. 18, 21, and 24). 
The website for Cunningham (2021) includes associated Stata programs and 
examples. 


The Stata ERM commands such as eregress and eprobit estimate TEs in 
linear and leading nonlinear models with treatment that is exogenous or 
endogenous. The Stata et commands such as etregress and eteffects 
cover ET binary treatment modeled by a probit model. For ET, these 
commands require an instrument, require specification of functional forms, 
including the way in which any TE heterogeneity occurs, and can require 
assumptions as strong as joint normality of model errors. For linear models 
with heterogeneous responses, the community-contributed ivtreatreg 
command of Cerulli (2014) deals with specific types of heterogeneity. 


Abadie and Cattaneo (2019) and Huber (2019) provide excellent 
comprehensive surveys that emphasize quasi-experimental methods with 
endogenous treatment. 


This chapter provides only simple examples of the various quasi- 
experimental methods with endogenous treatment. Imbens and 
Rubin (2015) cover LATE, and Huber and Wüthrich (2019) 


The various causal methods of this chapter rely on strong assumptions. 
These can be partly testable using falsification tests, placebo tests, and 
graphical tools that vary with the causal method used. For brevity, these 
have not been emphasized here but should be included in any application. 


The topic of quasi-experimental methods is a very active area of 
research. The methods presented in this chapter are currently being refined 
and extended. 


25.11 Exercises 


1. For the multilevel treatment example of section 25.3.3, perform 
separate analysis for men and women (note that variable h male needs 
to be dropped as a regressor). Does the ATE differ by gender? 
Comment. 

2. In section 25.4.2, the etregress command is used to estimate the TE of 
insurance using three different estimators. In each case, the model 
being fit uses ssiratio as the single instrument and hence is just 
identified. Now, fit an overidentified model with an additional 
instrument by omitting the variable 1inc from the outcome equation 
for 1drugexp, but designate it as a second excluded instrument. Refit 
the overidentified model using all three methods, and compare the 
estimated TEs with those from the just-identified model. 

3. In exercise 2 above, the sample pools male and female subjects. The 
model specification allows for heterogeneity in response up to an 
intercept shift. The estimated TE is the same for both sexes. Consider 
the hypothesis that the TEs differ between male and female subjects. 
Using the specification of section 25.4.2, estimate the TEs for male and 
female groups, and compare them. 

4. The TE estimate from the linear specification used in section 25.4.2 is 
directly available from the regression estimate. Consider the following 
nonlinear-in-variables specification that includes an additional 
interaction variable generated by the product of hi_empunion and 
female. What additional complications of estimation and interpretation 
arise in this case? Which variant of the etregress command would be 
suitable for estimating the TE? Generate the interaction effect, add it to 
the specification, apply your estimation method, and obtain estimates 
of the TE using the margins command. 

5. Emergency room visits are expensive, and many could be avoided 
through better access to regular doctors. Apply the methods of 
section 25.5.4 to outcome variable ervisits. Compare the OLS 
estimate of the TE with the LATE estimate. Has increased access to 
Medicaid had the desired effect? 

6. Repeat the synthetic control example of section 25.6.2 with outcome 
lncigsale, the natural logarithm of cigsale. Note that you should 


also change the control variables in the synth command to 
lIncigsale (1988), lncigsale (1980), and lnci gsale (1975). 
Compared with the analysis with outcome cigsale, is there a change 
in the states with nonzero weight? Are the postoutcome TEs of 
comparable magnitude when analysis is in logs rather than levels of 
cigarette sales? 

. Generate data on y, x, and D using the code given at the start of 
section 25.7.2. Apply the rdplot and rdrobust commands, using 
command defaults, and compare your results with those obtained in 
section 25.7.2. 

. Generate data on y, x, and D using the code given at the start of 
section 25.7.2, with the change that y is generated with error that is 
rnormal (0,10) rather than rnormal (0,10). Obtain the SRD estimate of 
the TE using the rdrobust command, and note that estimate precision 
has been greatly increased. Next, create an FRD design by redefining D 
aS D=x+truniform(-10,10)>0. Use commands rdplot, rdrobust, and 
then rdrobust with the option fuzzy (D). Comment on your results. 


Chapter 26 
Spatial regression 


26.1 Introduction 


Spatial regression models are models for observations that are related, with 
the strength of this relation decreasing as the distance between observations 
increases. 


The term “distance” is used quite broadly. The Stata documentation 
focuses on geospatial data for which the distance measure may be physical 
distance or on whether observational units share a common boundary. But 
the Stata spatial regression commands are applicable for any distance 
measure, such as economic distance or social distance or peer group 
membership or network ties such as friendships. 


In the simplest spatial regression models, only error terms are spatially 
correlated. This complication is comparable with one of autocorrelated 
errors or clustered errors. Then the usual estimators can be used, if error 
terms are uncorrelated with regressors, but valid statistical inference 
requires that the standard errors of parameter estimates be corrected to 
control for the spatial correlation in the errors. More efficient estimation, 
such as feasible generalized least squares (FGLS), 1s possible if we 
additionally specify a particular model for the error correlation. 


A much greater complication arises if the dependent variable for one 
observation depends directly on the values of the dependent variable for 
nearby observations. This is comparable with autoregressive time-series 
models where the dependent variable in the current period depends directly 
on values of the dependent variable in previous periods. The spatial 
relationship is represented by one or more spatial weighting matrices W. 


Spatial regression presents unique complications. The key is forming 
any spatial weighting matrices and specifying the pathway of spatial 
correlation that could be via the dependent variable of nearby individuals, 
regressors of nearby individuals, or errors of nearby individuals. The Stata 
spatial commands and documentation focus on forming weighting matrices 
W from geospatial data that provide longitudinal and latitudinal 


coordinates. But the user can also provide his or her own W matrix where 
necessary, such as with peer-effects data and network data. 


26.2 Overview of spatial regression models 


The spatial autoregressive (SAR) in the (conditional) mean model specifies 


N 
yi =x, G+ N wig; + Uj 
j=1 


where w;; = 0, so Yi does not appear in the right-hand side. In matrix 
notation, we have 


y = AWy + X6 +u 


where W is an N x N spatial weighting matrix that has diagonal entries 
zero. The term “SAR” arises by analogy to the time-series autoregressive 
model y: = pyz—1 + u: + X46 , and the additional term “Wy” is called a 
spatial lag (in the dependent variable). 


The crucial ingredient is specification of the weights wij. In principle, 
these could be a given function of distance that depends in part on unknown 
parameters that are estimated. Instead, the sp commands use as weights 
constants that need to be specified by the researcher. For different types of 
spatial data, there are different commonly used methods for specifying the 
spatial weights. For example, in the standard peer-effects model, it is 
assumed that only peers matter and that the peer effect is the average of 
one’s peers. If person 1 has persons 2, 3, and 4 as his or her only peers, then 
yı = X16 + A{(yo + y3 + ya) /3} + u1, 80 w12 = w13 = w14 = 1/3 and 
all other wi; = 0. 


A different form of spatial dependence is solely through the error term. 
N 
A SAR in the error model of order one specifies u; = 5} j=1,jżi Wig Uy + Eis 
SO 


y=XG+u 
u = pW, u + E 


where the underlying errors €; are assumed to be independent. 


The Stata SAR models allow spatial dependence through the conditional 
mean, model errors, and regressors. The SARAR(1,1) model, with one spatial 
lag in y and one spatial lag in u, as well as some spatial dependence 
through the regressors, specifies 


y =~AW,yy + XB+YW;x, +t u 
u = pWŴW,u +e 


where the single regressor Xp is usually a regressor contained in X. The 
errors €; are assumed to be independent and, in the case of maximum 
likelihood (ML) estimation, independent and identically distributed (1.1.d.) 
normal. Not all the three spatial lags need to be included in this model, and 
where more than one of these spatial lags appears, a common spatial 
weighting matrix W may be used. 


Even more general SARAR(P, q) models may add additional spatial lags. 
For example, we may have two spatial lags in the mean (p = 2), so 


y= A Wiy + A2Woy + X6 + YW Xp +u 


The Stata SAR commands focus on these SARAR models. The spregress 
command therefore covers models that allow for spatial dependence 
through the dependent variable, through the errors, and through the 
regressors. If the explanatory variables are endogenous and instruments are 
available, then the spivregress command can be used. And for data with 
multiple observations per unit, such as panel data, one can use the 
spxtregress command. 


26.3 Geospatial data 


The [sP] Stata Spatial Autoregressive Models Reference Manual uses data on 
the homicide rate in 1990 in counties in the southern United States. We use 
the same data, except analysis is restricted to the 159 counties in the state of 


Georgia. 


26.3.1 Spatial dataset 


The key dataset on homicides in Georgia is 


. * Read in Georgia homicide data and summarize 


. qui use mus226georgia, clear 


. describe _ID _CX _CY cname hrate ln_population poverty 


Variable Storage Display Value 
name type format label Variable label 
_ID int 412.0g Spatial-unit ID 
_CX double %10.0g x coordinate of area centroid 
_CY double %10.0g y coordinate of area centroid 
cname str20 %20s County name 
hrate double %12.10f Homicide rate per 100,000 
In_population double %12.10f Log of population size 
poverty double %12.10f Percentage of families below poverty 


line 


. summarize _ID _CX _CY cname hrate ln_population poverty 


Variable Obs Mean Std. dev. Min Max 
_ID 159 2424.365 179.4522 2100 2722 
_CX 159 -83.57566 1.039462 -85.50329 -81.15262 
_CY 159 32.8082 1.18595 30.70841 34.91453 
cname (0) 
hrate 159 12.39901 6.723074 (0) 37.55566 
ln_populat™“n 159 9.872061 1.076768 7.557473 13.38311 
poverty 159 15.49396 6.097446 1.916347 30.41627 


There are 159 observations, one for each county. The dependent variable of 
interest is hrate, the homicide rate per 100,000 people in 1990. Explanatory 


variables will be the log of population and the poverty rate. The variable Ip 


is the unique code for each county. The variables cx and _cy give the two- 
dimensional geographic coordinates of the centroid of each county. 


The variables 1p, cx and_cy were created by earlier use of the spset 
command, which is detailed in [SP] spset. We have 


. * Attach to correct shapefile using spset, modify, and save 
. spset, modify shpfile(mus226georgia_shp) 

(creating _ID spatial-unit id) 

(creating _CX coordinate) 

(creating _CY coordinate) 


Sp dataset: mus226georgia.dta 
Linked shapefile: mus226georgia_shp.dta 
Data: Cross sectional 
Spatial-unit ID: _ID 
Coordinates: _CX, _CY (planar) 


. Save mus226georgia, replace 
file mus226georgia.dta saved 


The coordinates are planar, rather than degrees of latitude and longitude. 
This distinction is explained below in the subsection on measuring distance 
between observations. The dataset is linked to a shapefile, which we 
consider next. 


26.3.2 Geospatial shapefile 


Geospatial analysis can be done using just the preceding dataset, which 
includes the centroids of each county that can be used to obtain distances 
between each county. 


For some geospatial analysis, however, it is beneficial to additionally 
have shapefiles that detail the boundaries of each county. In particular, this 
enables construction of heat maps that indicate the values of a variable in 
each region, and it enables the spatial weighting matrix W to be a contiguity 
matrix for which Wij is positive if observations į and j are in adjoining 
regions. 


Shapefiles in standard format can be obtained from sources on the web. 
These then need to be converted to a Stata format shapefile, using the 
spshape2dta command that links the shapefile and the original data file; for 
details, see [SP] Intro 4. 


The associated Stata format shapefile for the current example is 


. * Read in shapefile with coordinates for Georgia counties and summarize 
. qui use mus226georgia_shp, clear 


summarize 
Variable Obs Mean Std. dev. Min Max 
_ID 4,737 2430.068 179.2355 2100 2722 
_X 4,576 -83.52931 1.063287 -85.60899 -80.89492 
_Y 4,576 32.75887 1.188491 30.36106 35.00028 
rec_header (0) 
shape_order 4,737 17.00591 11.12619 1 58 


Here 4,576 sets of coordinates are used to define the boundaries of the 159 
counties. 


For example, for the county with county identifier 2108, we have 


* List the county boundary coordinates for county 2108 
. list _ID _X _Y shape_order if _ID == 2108, clean 


_ID _X wave shape_“r 
190. 2108 A f 1 
191. 2108 -83.776512 34.788391 2 
192. 2108 -83.813553 34.891724 3 
193. 2108 -83.849831 34.888264 4 
194. 2108 -83.864494 34.900963 5 
195. 2108 -83.907837 34.914978 6 
196. 2108 -83.921082 34 . 943623 T 
197. 2108 -83.939964 34.961624 8 
198. 2108 -83.937996 34.989391 9 
199. 2108 -83.549416 34.989536 10 
200. 2108 -83.551826 34.944477 11 
201. 2108 -83.590721 34.936024 12 
202. 2108 -83 . 603844 34.90472 13 
203. 2108 -83.663338 34.870159 14 
204. 2108 -83.654549 34.814987 15 
205. 2108 -83.679329 34.794216 16 
206. 2108 -83.709503 34.780991 17 
207. 2108 -83.733513 34.791134 18 
208. 2108 -83.776512 34.788391 19 


The boundaries of county 2108 are defined using 18 (X, Y) coordinates. 


26.3.3 Heat maps 


The grmap command produces a heat map or choropleth map that presents a 
geographic map with different shadings used to display different values of 
the variable of interest. 


For the homicide rate, we obtain 


. * Provide a heat map for homicide rates in Georgia counties 
. qui use mus226georgia, clear 

. grmap, activate 

grmap already activated 


. grmap hrate, clmethod(custom) clbreaks(0 10 20 40) legend(pos(1)) fcolor (Greys) 


(2.00000e+01,4.00000e+01] 


(1.00000e+01 ,2.00000e+01] 
[0.0000000000,1.00000e+01] 


Figure 26.1. Heat map of homicide rate in various counties 


The resulting map is given in figure 26.1. The darkest shaded area has 
the highest homicide rates, in excess of 20 homicides per 100,000 people. 
For example, the darker area at three o’clock on the map covers counties in 
the Augusta region. There does appear to be some spatial correlation because 


there are several clusters of dark-shaded counties and several clusters of 
lightly shaded counties. 


26.3.4 Geospatial distance 


Geospatial data are often located using latitude, measured in degrees north or 
south of the equator, and longitude, measured in degrees west or east of the 
north-south meridian passing through the British Royal Observatory in 
Greenwich. 


In computing Euclidean distance between individuals, one cannot 
immediately use degrees of latitude and longitude. For example, a movement 
directly east of one degree of longitude at the equator is a movement of 
approximately 69 miles, while at 80 degrees north, it is a movement of only 
12 miles. 


If sp datasets use latitude and longitude, then cx is longitude and _cy is 
latitude (note the reverse ordering). The current dataset appears to use 
latitude and longitude because, for example, Atlanta, Georgia, has 
coordinates (33.74900N, 84.38800W). However, this is not the case, because 
the earlier spset command stated that planar coordinates are used instead. 
Apparently, the planar coordinates have been scaled to be approximately the 
same as the latitude and longitude coordinates for the Georgia region. 


The coordsys() option of the spset command can set the coordinate 
system to be planar or latitude and longitude. The spdistance command 
computes the Euclidean distance between two observations, with appropriate 
adjustment for the earth’s curvature if latitude and longitude is used instead 
of planar coordinates. 


For the current example with planar coordinates, we compute the 
distance between counties 2100 and 2103. We have 


* Compute Euclidean distance between observations using planar coordinates 
spdistance 2100 2103 
(data currently use planar coordinates) 


_ID (x, y) (planar) 
2100 (-83.40281, 34.87811) 
2103 (-84.96217, 34.80377) 
distance 1.5611309 planar units 
. display "Distance = " sqrt((-83.40281-(-84.96217) ) ~2+(34.87811-34.80377) “2) 


Distance = 1.561131 


26.4 The spatial weighting matrix 


Compared with basic regression, the only extra information needed to use 
Stata’s sp regression commands is the spatial weighting matrix W. The 
matrix W can be created elsewhere and stored on an external file or created 
using community-contributed mata code or created using sp commands. In 
all cases, the weighting matrix is ultimately input to Stata using one of the 
many variants of the spmatrix command. 


26.4.1 Creating a spatial weighting matrix 


Here we create W in the special case of geospatial data by directly using the 
spmatrix create command. This requires a Stata dataset that has been 
spset. The spmatrix create command creates a spatial weighting matrix 
that may be either an inverse-distance matrix or a contiguity matrix that is 
based on sharing borders. 


For an inverse-distance matrix, one uses the spmatrix create 
i distance command. Then the weight Wij is inversely proportional to the 
distance between ; and j. The vtruncate (#) option truncates to zero for 
distances greater than #. This command requires coordinates for each 
observation but does not require a shapefile. 


For a contiguity matrix, the default for the spmatrix create 
contiguity command is to set wi; to 1 (before normalization) if į and 7 
share a border or a vertex and to set Wij to 0 otherwise. Various options 
instead set Wi; to 0 if the only thing shared is a vertex (rather than a border) 
or, more expansively, also set Wij to 1 (before normalization) if į and j are 
neighbors of neighbors. This command requires a shapefile in addition to 
coordinates for each observation. 


In either case, the matrix W is normalized, using the normalize () 
option. The default option is normalize (spectral), which rescales every 
entry in the unnormalized spatial matrix by multiplication by a constant so 
that the largest eigenvalue of W equals 1. The normalize (minmax) option 
rescales every entry in the unnormalized spatial matrix by division by a 


constant equal to the smallest of the largest row sum or the largest column 
sum. The normalize (row) option multiplies all entries in each row by a 
constant to ensure that the entries in each row sum to 1, so pee Wij = 1. 


The normalize (none) option leaves the matrix unnormalized. 


Because AW = (A/a) x (aW), a normalization that multiplies all 
entries in W by a common constant is innocuous; it changes the associated 
scalar parameter only from A to A/a. The remaining estimates after 
commands such as spregress are unchanged. In the case of row 
normalization, however, all estimates will change because different 
rescalings have been applied to different rows of W. 


For the current example, a contiguity weighting matrix is obtained, using 
the default spectral normalization. We have 


. * Create & summarize weighting matrix W - contiguity with spectral normalization 
. Spmatrix create contiguity W 


. spmatrix summarize W 


Weighting matrix W 


Type contiguity 
Normalization spectral 
Dimension 159 x 159 
Elements 
minimum (0) 
minimum > 0 . 1651123 
mean .0055906 
max . 1651123 
Neighbors 
minimum 1 
mean 5.383648 
maximum 10 


The 159 x 159 matrix W has entries that are either () or 0.1651123. 
Counties have on average 5.38 neighbors, where a neighboring county is one 
that shares a border or vertex. Counties have between 1 and 10 neighbors, so 
the row sums of W range from 0.1651123 to 1.651123. Further details on 
the matrix W can be obtained by using the spmatrix matafromsp command 
to pass W to mata and then using relevant mata matrix commands. 


At this stage, the user should think seriously about whether this is a 
reasonable weighting matrix. Suppose county i has three neighbors, counties 


m, n, and o. Then, ignoring regressors and errors, the SAR in mean model 
specifies y; = A(0.165 xX ym + 0.165 x yn + 0.165 x yo). By contrast, if 
county j had only the 1 neighbor, county m, we have y; = A(0.165 x ym). 
The weights impose the restriction that homicides in neighboring counties 
have less impact when there are fewer neighboring counties. A better model 
may specify that we use the average of the homicides over adjoining 
counties. Then y; = 6 x (Ym + Yn + yo)/3 and yj = ô X Ym. This 
corresponds to applying a row normalization, the normalize (rows) option, 
rather than the default spectral normalization to the contiguity matrix. 


The importance of the specification of the spatial weights cannot be 
overemphasized. The weights determine any spatial spillovers. And 
consistent estimation requires correct model specification, including 
specification of the spatial weighting matrix. 


26.4.2 Creating spatial lag variables 


Given a spatial weighting matrix W, spatial lags of variables such as Wy 
and Wx can be constructed using the spgenerate command. 


For example, the spatial lag of the homicide rate is 


. * Create spatial lag Wy 
. Spgenerate Whrate = W*hrate 


. summarize hrate Whrate 


Variable Obs Mean Std. dev. Min Max 
hrate 159 12.39901 6.723074 (0) 37.55566 
Whrate 159 11.25052 4.591557 . 7547128 27 .27849 
. correlate hrate Whrate 
Cobs=159) 


hrate Whrate 


hrate 1.0000 
Whrate 0.2229 1.0000 


The spatially lagged variable is weakly correlated with the unlagged 
variable. 


26.5 OLS regression and test for spatial correlation 


Ordinary least-squares (OLS) regression of the homicide rate on log 


population and the poverty rate yields 


. * OLS estimation using regress 
. regress hrate ln_population poverty, vce(robust) 


Linear regression Number of obs = 159 
F(2, 156) = 21.33 
Prob > F = 0.0000 
R-squared = 0.2326 
Root MSE = 5.9273 

Robust 
hrate | Coefficient std. err. t P>|t| [95% conf. interval] 
ln_population 1.614672 . 6396782 2.52 0.013 .3511237 2.878221 
poverty . 6349861 . 1013695 6.26 0.000 . 4347522 . 83522 
_cons -13.37958 7 . 453859 -1.79 0.075 -28.10309 1.343937 


A 1% increase in the population is associated with 0.016 more homicides per 
100,000 people, and an increase in the poverty rate by 1 percentage point is 
associated with a substantial 0.63 increase in the homicide rate (the mean 
homicide rate is 12.39). These effects are statistically significant at level 


0.05. 


A standard test for spatial correlation is the Moran 7 test. The 
distribution of the test statistic is obtained under the null hypothesis that 
Yi = xi B + ui, where u; are i.i.d. (0,07). The test requires that a spatial 
weighting matrix W be specified under the alternative hypothesis. The test 
is asymptotically equivalent to a score test of an SAR(1) in mean model with 
spatial weighting matrix W or of an SAR(1) in error model with spatial 
weighting matrix W, against a model with no spatial dependence. 


The test is implemented using the estat moran command. We have 


. * Moran test for spatial correlation following OLS 
. qui regress hrate ln_population poverty 


. estat moran, errorlag(W) 


Moran test for spatial dependence 
HO: Error terms are i.i.d. 
Errorlags: W 


chi2(1) = 0.59 
Prob > chi2 = 0.4417 


The test statistic has p = 0.44 > 0.05, so we do not reject the null hypothesis 
of no spatial correlation at level 0.05. There is little evidence of spatial 
correlation in these data, at least when we use the current spatial weighting 
matrix W. So we expect the subsequent analysis using spatial regression 
commands with this specification of W to yield results similar to OLS. 


26.6 Spatial dependence in the error 


When there is spatial dependence only in the error, a number of methods 
have been proposed to obtain correct standard errors and to obtain more 
efficient estimates. 


26.6.1 Spatial heteroskedastic- and autocorrelation-consistent standard 
errors for OLS 


OLS of y: on X; remains consistent when the only spatial dependence is in the 
error. Spatial heteroskedastic- and autocorrelation-consistent (HAC) standard 
errors provide spatial robust standard errors analogous to, for example, HAC 
standard errors for time-series data. These standard errors are nonparametric 
in that they do not require an explicit model for the correlation in model 
errors. 


In general, the variance matrix of the OLS estimator is 


fies N N 
V (3) = (X'X)' S “SUE (xix uiu) > (XX) + 


i=1j=1 


where the only terms in the middle matrix that contribute are the terms for 
which E(xix}uiuj) # 0. In the spatial context, E(x;xj,u;u;) — 0 as the 
distance between individuals ; and 7 grows. 


In the case of a one-dimensional distance measure d;;, with 
E(x;x;uiuj) = 0 for dij > 6, a simple spatial HAC variance estimate is 


V (B) = (XX) {YO Y1 (diy < 5) (xti) » (XK) 


i=1 j=1 


where it is assumed that (1/N°)X 2; X; 1(dij < 5) > 0 as N — oo. More 
generally, kernel weights may be used in place of 1(d;; < 6). This estimate 
generalizes straightforwardly to instrumental-variables (Iv) and generalized 
method of moments estimation. 


Conley (1999) proposed this estimator for generalized method of 
moments estimation. He more generally allowed for a multidimensional 
distance measure, such as (X, Y) coordinates for geospatial data. Then data 
are viewed as being arranged on a lattice. Kelejian and Prucha (2007) 
present a similar HAC estimate for OLS and Iv estimation, using alternative 
assumptions regarding the stochastic process for spatial correlation. 


We obtain HAC standard errors using the estimate of Conley (1999). In 
the one-dimensional coordinate case with coordinate C;, the distance 
dj; = Ci — C}, and Conley uses the kernel weight 1(d;; < 6) x (1 — dij /ô) 
that has declining weights rather than the simpler 1(d;; < 6) that has 
uniform weights. In the two-dimensional case, with distances 
dli; = Cl; — C1; and d2;; = C2; — C2;, we use the kernel weight 
1(d1,; < 61 )1(d2;; < ô2) x (1 = d1;;/d1)(1 = d2ij/ô2). 


The community-contributed x ols command (Dube 1999) implements 
this method. It requires passing, in turn, the two variables with the X and Y 
coordinates, the distance cutoffs for the two coordinates, the dependent 
variable, the regressors (including an intercept if relevant), and options 
giving the number of regressors (including an intercept if relevant) and the 
number of coordinates. 


. * OLS with Conley spatial HAC standard errors using user addon x_ols 
. generate const = 1 


. generate cutoff1 = 1.04 // One standard deviation of _CX 
. generate cutoff2 = 1.19 // One standard deviation of _CY 
. x_ols _CX _CY cutoff1 cutoff2 hrate ln_population poverty const, xreg(3) coord(2) 


Results for Cross Sectional OLS corrected for Spatial Dependence 


number of observations= 159 
Dependent Variable= hrate 


variable ols estimates White s.e. s.e. corrected for spatial dependence 
1n_populationi.6146721 .5273927 .57420456 
poverty - 63498613 .09313404 .1010521 


const -13.379577 6.1467955 6.689321 


The cutoffs were chosen to be the standard deviations of, respectively, cx 
and cy. The standard errors denoted white s.e. are actually default 
nonrobust standard errors based on s?(X’X)~1. Compared with the 
heteroskedastic—robust standard errors given in earlier output, the spatial 
HAC standard errors are slightly smaller. 


An alternative HAC estimate for this geospatial data example would use a 
single distance measure d;;. This could, for example, be the Euclidean 
distance d;; = ,/(C1; — Cl;)? + (C2; — C2;)?. The x_ols command does 
not directly cover this case because it takes as inputs N coordinates such as 
(C1;,C2;),i=1,...,N, rather than y2 distances dij, i = 1,..., N, 
Flasi se 


26.6.2 FGLS estimation 


More efficient FGLS estimation is possible if a parametric model for Cov 
(ui, u;|X) is specified. 


The SAR in error model fit by the spregress command (see section 26.7) 
is one such model. Here we briefly mention some other models. 


The negative exponential distance decay model specifies 
E(uu’) = o? (I + dW), where U;; = exp(—ydi;). 


The common shocks model with m common shocks specifies 
u = Hv + e, where v is m x 1, H is an N x m weighting matrix, and the 
underlying errors v; and €; are iid. Then E(uu’) = o2HH’ + 071, and 
estimation is straightforward because a closed-form solution exists for 
(o2? HH’ + o7I)7?. 


A factor model assumes observations aggregate to a super region g 
(group) with common shocks for observations in region g. Then 
Yig = XjgB + Uig, Uig = ĝi fg + Eig, 6; is a factor loading, and f, is a factor. 
This can be viewed as a special case of the common shocks model. 


26.7 Spatial autocorrelation regression models 


If spatial correlation is only in the error term, then OLS is consistent, and the 
only challenges are to obtain standard errors that control for spatial 
correlation or to obtain FGLS estimates that are more efficient than OLS. 


If instead spatial correlation appears as a spatial lag in the dependent 
variable, then OLS is inconsistent, so alternative estimation methods are 
needed. 


For example, suppose the first two observations are dependent on each 
other, with 1) y1 = Ay2 + uy and 2) yo = Ayı + u2. Then from 2) y2 
depends on ¥1, which from 1) depends on 1. It follows that Y2 is correlated 
with u1, so in 1) the explanatory variable y2 is correlated with the error u1, 
leading to OLS being inconsistent. This problem arises even if the errors u1 
and u2 are independent. 


Consistent estimates can be obtained by Iv estimation, using values of 
regressors for neighbors as instruments, or by ML estimation under the 
stronger assumption of 1.1.d. errors. The spregress command implements 
these approaches for various SAR models. 


26.7.1 The spregress command 


The spregress command has syntax 


spregress depvar | indepvars | [ of | lin], gs2sls | gs2sls_options | 
spregress depvar | indepvars | [ of [in], ml | ml_options | 


The most important options determine which of two possible estimation 
procedures are used. The default gs2s1s option provides generalized spatial 
two-stage least-squares (GS2SLS) estimates under the assumption that errors 
are independent, albeit potentially heteroskedastic if the additional 
heteroskedastic option is used. The alternative m1 option provides ML 
estimates under the assumption that errors are 1.1.d. normal. Note that, unlike 
the usual OLS and two-stage least-squares (2SLS) commands, the spatial 


estimators become inconsistent if the preceding assumptions on the errors do 
not hold and the assumptions for the m1 option are especially strong. 


Other ae options specify the nature of the spatial correlation. The 
dvarlag(w) option includes Wy, a spatial lag in the dependent variable, as a 
regressor. The errorlag (w) option includes a spatial lag Wu in the error, so 
u = Wu + e. The ivarlag(W:varlist) option includes Wx,, spatial lags 
in some of or all the explanatory variables as regressors. 


The gs2s1s option permits more than one dae matrix to be specified 
for each of dvarlag(), errorlag(), and ivarlag(w). For example, the 
SAR(2) in mean model includes Wiy and Woy as oe The mı option 
permits more than one spatial matrix to be specified for ivarlag() only. 


The spregress command without any of the options dvarlag(), 
errorlag(), and ivarlag(w) yields OLS. We use option heteroskedastic to 
obtain pe Pc RRA standard errors. 


. * OLS estimation using spregress 

. spregress hrate ln_population poverty, gs2sls heteroskedastic 
(159 observations) 
(159 observations (places) used) 


Spatial autoregressive model Number of obs = 159 
GS2SLS estimates Wald chi2(2) = 43.48 
Prob > chi2 = 0.0000 

Pseudo R2 = 0.2326 

hrate Coefficient Std. err. Zz P>|z| [95% conf. interval] 
ln_population 1.614672 .6336148 2.55 0.011 37281 2.856534 
poverty .6349861 . 1004086 6.32 0.000 . 4381888 . 8317834 

_cons -13.37958 7.383205 -1.81 0.070 -27 .85039 1.091237 


The coefficient estimates are the same as those obtained using the regress 
command. The standard errors are / (N — K)/N times those obtained 
using the regress, vce(robust) command, due to slightly different 
degrees-of-freedom adjustment. 


26.7.2 SAR(1) in mean model 


The SAR(1) in mean model specifies 


y=AWy+ X6B+u 


where it is assumed that E(u|X) = 0 and u; are independent over į. 


This model can be rewritten as (I — \W)y = X 6 + u, and solving for 
y yields the reduced form 


y = (I— AW) 'XB4+ (I — AW) 'u 
Premultiplying the reduced-form expression for y by W yields 
Wy = W(I— AW) XB + W(I — AàAW)~`u 


Clearly, Wy is correlated with u, and hence OLs of y on Wy and X in the 
original model will lead to inconsistent estimation even if the errors u; are 
iid. 


GS2SLS estimation 


Consistent estimates can be obtained by Iv using instruments that are readily 
obtained; the instruments need not be variables external to the model. 


Good instruments are ones that are highly correlated with Wy but are 
uncorrelated with u. If errors are 1.1.d., the optimal instrument is E(Wy|X). 
Because E'(u|X) = 0, the preceding results imply 


E(Wy|X) = W(I- AW) X8 
= WXB + AW° XB +A W°XB +- 
provided 0 < \ < 1. In principle, we could use W (I — \W)~1!X@ as the 


instrument, but this requires inversion ofan N x N matrix and replacing A 
and 8 by consistent estimates. 


Instead, we use as instruments for Wy a subset of WX, W2X,.... The 
default for spregress is TV regression of y on X and Wy with instruments 
the linearly independent columns of [K, WX, W?X]. Higher powers such 
as W2X can be added using the impower (#) option. 


Estimation is by 2SLs because there is more than one instrument, and 
inference is based on standard errors that are robust to heteroskedasticity. 
This estimator is implemented using the gs2sis option of the spregress 
command. The term “Gs2SLs” is used to cover generalization detailed below 
for models that have a spatial lag in the error. 


For the homicide rate example, we obtain 


. * SAR(1) in mean: Errors independent 

. spregress hrate 1ln_population poverty, gs2sls dvarlag(W) heteroskedastic 
(159 observations) 
(159 observations (places) used) 
(weighting matrix defines 159 places) 


Spatial autoregressive model Number of obs = 159 
GS2SLS estimates Wald chi2(3) = 44.27 
Prob > chi2 = 0.0000 
Pseudo R2 = 0.2329 
hrate | Coefficient Std. err. z P>|zl [95% conf. interval] 
hrate 
1n_population 1.581835 .6481573 2.44 0.015 .3114704 2.8522 
poverty .6226799 . 1087889 5.72 0.000 . 4094577 .8359022 
_cons -13.29208 7.410359 -1.79 0.073 -27 .81612 1.231955 
W 
hrate .0379842 . 1123303 0.34 0.735 -.1821791 . 2581475 
Wald test of spatial terms: chi2(1) = 0.11 Prob > chi2 = 0.7353 


The spatial lag coefficient estimate \ — 0.038 is small and statistically 
insignificant, and the estimated coefficients of the regressors are quite close 
to the original OLS estimates. 


To illustrate that Gs2sLs in the case of an SAR(1) in mean model is simply 
2SLS with appropriate choice of instruments, we consider the case where the 
instrument set goes out to only one lag. Then the instruments for Wy and x 
are simply x and Wx. We create Wx regressor by regressor using the 


spgenerate command (Wy was created earlier) and estimate using the 
ivregress command. Results are compared with the spregress command 
with option impower (1). We obtain 


. * SAR(1) in mean: Errors independent estimated using ivregress 
. generate one = 1 


. Spgenerate Wone = W*one 
. spgenerate Wln_population = W*ln_population 
. Spgenerate Wpoverty = W*poverty 


. qui ivregress 2sls hrate 1ln_population poverty 
> (Whrate = W*one Wln_population Wpoverty), vce(robust) 


. estimates store IVREG1 


. qui spregress hrate ln_population poverty, gs2sls dvarlag(Ww) 
> heteroskedastic impower (1) 


. estimates store SPREG1 
. estimates table IVREG1 SPREG1, b(%9.4f) se eq(1) 


Variable IVREG1 SPREG1 
#1 

Whrate 0.0314 
0.1116 

ln_populat~n 1.5875 1.5875 

0.6440 0.6440 

poverty 0.6248 0.6248 

0.1082 0.1082 

_cons -13.3072 -13.3072 

7.3992 7.3992 


hrate 0.0314 
0.1116 


Legend: b/se 
The results from ivregress and spregress are identical. 


ML estimation 


An alternative estimator for the SAR(1) in mean model is the maximum 
likelihood estimator (MLE). 


Under the assumption that the errors are 1.i.d. N (0, 07), the log 
likelihood is 


N N 
In L (B, A, o°) = ln |I — AW| + -z m2r — zno” 


(y — AWy — XB)’ (y — AWy — XB) 
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Estimation of A is computationally challenging for large N because 


|I — AW| is the determinant ofan N x N matrix. Given Jyp, the ML 
estimate of 8 is easily obtained by OLs regression of y — Ami Wy on X. 


When errors are i.i.d. N(0, 07), the usual ML results apply, provided W 
satisfies certain restrictions. When errors are 1.1.d. but not necessarily 
normal, then the MLE may still be consistent under some assumptions 
regarding the sparsity of the spatial weighting matrix; see Lee (2004). When 
errors are not 1.1.d., even simply heteroskedastic, then the MLE will generally 
be inconsistent. 


The MLE can be computed using the m1 option. We have 


. * SAR(1) in mean: ml with errors iid normal 
. spregress hrate ln_population poverty, ml dvarlag(W) 


(159 observations) 


(159 observations (places) used) 


(weighting matrix defines 159 places) 
Performing grid search ... finished 
Optimizing concentrated log likelihood: 
Iteration 0: log likelihood = -506.88358 
Iteration 1: log likelihood = -506.81762 
Iteration 2: log likelihood = -506.81762 
Optimizing unconcentrated log likelihood: 
Iteration 0: log likelihood = -506.81762 
Iteration 1: log likelihood = -506.81762 (backed up) 
Spatial autoregressive model Number of obs = 159 
Maximum likelihood estimates Wald chi2(3) = 48.82 
Prob > chi2 = 0.0000 
Log likelihood = -506.81762 Pseudo R2 = 0.2328 
hrate | Coefficient Std. err. Zz P>|z| [95% conf. interval] 
hrate 
1n_population 1.558282 .528013 2.95 0.003 . 5233959 2.593169 
poverty -613853 0971825 6.32 0.000 . 4233788 . 8043272 
_cons -13.22932 6.081798 -2.18 0.030 -25.14943 -1.309218 
W 
hrate .0652294 . 0958409 0.68 0.496 -.1226154 . 2530741 
var (e.hrate) 34.34834 3.852857 27 .56934 42.79422 
Wald test of spatial terms: chi2(1) = 0.46 Prob > chi2 = 0.4961 


The coefficients of explanatory variables are very similar to those from 
GS2SLS estimation, while the spatial lag coefficient estimate has increased 
from 0.038 to 0.065 but is still small and statistically insignificant. The ML 


standard errors are approximately 20% smaller. 


26.7.3 Prediction and marginal effects 


In the AR(1) time series model yg = pyt—-1 + Ox + ur, distinction is made 
between the initial impact of a one-unit change in £+, which equals 6, and 
changes in subsequent periods of p6, p? 68, ... that lead to total impact of 


B/Q -= p) it |p| < 1. 


A qualitatively similar distinction is made in spatial models, where 
changes in the value of a regressor for any given observation spills over to 
potentially affect the value of the dependent variable for all observations, not 
just the current observation. 


Prediction 


The reduced form in the SAR in mean model is 
B(y|X) = (1 - XW)! XB 


This is called the reduced-form mean for y. 


Defining the N x N matrix S = (I — \W) ', it follows that 
E(y;|X) = D Sijx4 B. The reduced-form mean for the ith observation 
therefore depends directly on its own regressor value x; through S;;x;G and 
indirectly through regressors for all other observations through 
ee j¢i0ijX;G. Stacking all observations, the direct mean for y is diag 
(S)X{, and the indirect mean for y is {S — diag(S)} X8. 


The predict postestimation command provides sample analogues of 
these means. Define S = (I — \W) 1. The default r form option gives the 
reduced-form prediction 7 = SX. The direct option gives the direct 
prediction 7 = diag(S)X. The indirect option gives the indirect 
prediction y = SxB = diag(S) XB. Other options of the predict command 
yield structural-form predictions. 


The pseudo- R2 measure reported in the output from sp estimation 
commands is the squared correlation between y and y, where y is the 
reduced-form mean prediction. 


The margins postestimation command can be applied to these various 
predictions. For marginal effects based on reduced-form predictions, it is 


easier to use the postestimation command estat impact. 


Direct, indirect, and total impacts 


The reduced-form mean for the jth observation is E (y;|X) = DE B. 
It follows that the total effect on the reduced-form mean for the ¿th 
observation of a change in the regressors for the jth observation is 
OE(y;|X)/Ox; = Sij. For the kth regressor, we have 

OE(y;|X)/Ox 5x = Dij Br: 


The total impact of a change in the kth regressor on the reduced-form 
mean is computed by averaging 0E (y;|X) /Ox,;;, over dependent variables y: 
that differ over observations as well as over regressors Tjk that differ over 
observations: 


teaa A 
m3 OX 5k = So 


w=1 j=1 


The total impact is then decomposed into direct and indirect effects, 
similar to the earlier decomposition of the prediction. Then the direct effect 
is 


This estimates the average own effect of a change in the kth regressor on the 
conditional mean of y given x. The indirect effect is 


1 & & OH GX)... 1 2 
po a PP 


which estimates the average cross effect or spillover effect of a change in the 
kth regressor on the conditional mean of y given x. 


The estat impact postestimation command calculates these three 
quantities by evaluating at the estimated parameters. 


We have 


. * Impact multipliers following SAR(1) in mean 
. qui spregress hrate ln_population poverty, gs2sls dvarlag(W) heteroskedastic 


. estat impact 
progress : 50% 100% 


Average impacts Number of obs = 159 


Delta-Method 


dy/dx std. err. z P>|zl [95% conf. interval] 
direct 
1ln_population 1.582175 6478725 2.44 0.015 . 3123677 2.851981 
poverty .6228135 . 1085008 5.74 0.000 -4101558 .8354711 
indirect 
1ln_population 0550951 . 1648445 0.33 0.738 -.2679942 . 3781844 
poverty .0216879 .0647816 0.33 0.738 -.1052818 . 1486576 
total 
1n_population 1.63727 . 6563342 2.49 0.013 . 3508783 2.923661 
poverty -6445013 . 1058766 6.09 0.000 . 436987 .8520157 


The indirect effects in this example are quite small, and the direct and total 
effects are similar to each other and to the original parameter estimates. This 
is due to the small spatial lag coefficient because § = (I- AW)! ~I 
when Ș§ ~ 9, 80 Su ~ 1 and Sij ~ 0 for j Æ i. 


26.7.4 SAR(1) in error model 


Another commonly used model is the SAR(1) in error model that specifies 


y=Xß+u 
u = pWu+e 


where £; is independently distributed though may be heteroskedastic. 


Note that in this model, oLs of Y on X yields consistent estimates, 
though the usual standard errors are inconsistent and more efficient FGLS 


estimation is possible. 


FG2SLS estimation 


The spatial lag in error model can be rewritten as (I — pW)u = e, and 
hence 


(I— pW)y = (I - pW) XB +e 


It follows that if £: is 1.1.d., then the FGLS estimator Buses can be obtained by 
OLS estimation of (I — pW )y on (I — p W)X, where f is a consistent 
estimate of p. 


The spregress command obtains an initial consistent estimate of p by 
solving for p in the equations ¢’/ we = 0 and 
€ {W'W — diag(W/W) }é = 0, where € = (I — pW )ū and Gare oLs 
residuals. Given nee computed using this initial estimate of P, the 
spregress command then obtains a more efficient second-stage estimate of 
p, one that varies according to whether the assumptions are relaxed to allow 


the underlying errors £; to be heteroskedastic. 


We have 


* SAR(1) in error: errors independent 
spregress hrate ln_population poverty, gs2sls errorlag(W) heteroskedastic 
(159 observations) 
(159 observations (places) used) 

(weighting matrix defines 159 places) 


Estimating rho using 2SLS residuals: 


initial: 


alternative: 


rescale: 

Iteration 
Iteration 
Iteration 
Iteration 


WNrR Oo 


GMM 
GMM 
GMM 
GMM 
GMM 
GMM 
GMM 


criterion = 
criterion = 
criterion = 
criterion = 
criterion = 
criterion = 
criterion = 


1.5942068 
16. 262889 
. 04927426 
. 04927426 
.01480645 
.01480552 
.01480552 


Estimating rho using GS2SLS residuals: 


Iteration 0: GMM criterion = .00033792 
Iteration 1: GMM criterion = .00032573 
Iteration 2: GMM criterion = .00032573 
Spatial autoregressive model Number of obs = 159 
GS2SLS estimates Wald chi2(2) = 40.67 
Prob > chi2 = 0.0000 
Pseudo R2 = 0.2324 
hrate Coefficient Std. err. z P>|z| [95% conf. interval] 
hrate 
ln_population 1.54974 .6363541 2.44 0.015 . 3025091 2.796971 
poverty . 638827 . 1038867 6.15 0.000 . 4352129 .8424411 
_cons -12.81266 7.419306 -1.73 0.084 -27 . 35423 1.728912 
W 
e.hrate . 1146457 . 1404569 0.82 0.414 -. 1606448 . 3899361 
Wald test of spatial terms: chi2(1) = 0.67 Prob > chi2 = 0.4144 


The spatial lag coefficient is small and statistically insignificant. Thus, the 
regression coefficients and standard errors are similar to those for the OLS 


estimator. 


26.7.5 SARAR(1,1) model 


The more general SARAR(1,1) model allows for spatial dependence through 
the dependent variable, errors, and regressors. Then, 


y =AWy + XB +7VX, +u= Zd+u 
u = pMu + € 


where £; is independent. Here X, denotes P regressors that need not 
necessarily be in X. The spatial weighting matrices W, V, and M need not 
be distinct. For example, we may have W = V = M. 


With this terminology, the SAR(1) in mean model is an SARAR(1,0) model, 
and the SAR(1) in error model is an SARAR(0,1) model. 


Consistent estimates can be obtained by Iv or 2SLS estimation. Combine 
the exogenous regressors and the spatial lag into X* = [X, V-X,]. Then the 
instruments for X* and Wy are the linearly independent columns of 
H = [X*, WX",..., W2X*]. The spregress default is q = 2. Then 
dosts = (Z'Z)~1Z'y, where Z = H(H'H)~1H’Z.- 


This 2SLS estimator is consistent but is inefficient unless errors are 
i.i.d. and there is no spatial error component. The spatial lag in error implies 
(I — pM)u = e, and hence (I — pM)y = (I — M)Zo + e. Given a 
consistent estimate p of P, the GS2SLS estimator is 
TERE (Z"'Z*) ALIY where y* = (I — pM)y, Z* = (I — pM)Z, 
7* — H* (H*’H*)~+H*’Z, and H* is composed of the linearly independent 
columns of [H, MH]. 


As an example, we have spatial lags in both the mean and the error and 
spatial lags in the two regressors. In all cases, the same spatial weighting 
matrix is used. We have 


* SARAR(1,1) model with additionally SAR in X: gs2sls 

spregress hrate ln_population poverty, gs2sls dvarlag(W) errorlag(W) 
> ivarlag(W:ln_population poverty) heteroskedastic 

(159 observations) 

(159 observations (places) used) 

(weighting matrix defines 159 places) 


Estimating rho using 2SLS residuals: 


initial: GMM criterion = 14.664337 
alternative: GMM criterion = .05894927 
rescale: GMM criterion = .05894927 
Iteration 0 GMM criterion = .05894927 
Iteration 1: GMM criterion = .04696624 
Iteration 2: GMM criterion = .0469661 
Iteration 3 GMM criterion = .0469661 
Estimating rho using GS2SLS residuals: 
Iteration 0: GMM criterion = .00162061 
Iteration 1: GMM criterion = .00128954 
Iteration 2: GMM criterion = .00128891 
Iteration 3: GMM criterion = .00128891 
Spatial autoregressive model Number of obs = 159 
GS2SLS estimates Wald chi2(5) = 85.58 
Prob > chi2 = 0.0000 
Pseudo R2 = 0.2452 
hrate Coefficient Std. err. P>|z| [95% conf. interval] 
hrate 
1n_population 1.24439 . 599829 0.038 .0687469 2.420033 
poverty . 7042393 . 1343533 0.000 -4409116 . 967567 
_cons -11.68442 7.016308 0.096 -25.43613 2.067295 
W 
1n_population .0059698 . 363766 0.02 0.987 -.7069985 . 7189381 
poverty -.4211137 . 2491727 -1.69 0.091 - . 9094834 .0672559 
hrate . 5942967 . 4283109 1.39 0.165 -.2451773 1.433771 
e.hrate -.3940704 . 3810944 -1.03 0.301 -1.141002 . 3528609 


Wald test of spatial terms: chi2(4) = 4.32 Prob > chi2 = 0.3644 


The four spatial lag parameters are statistically insignificant at level 0.05, 
both individually and jointly. The slope coefficients are within roughly 20% 
of OLS estimates and are estimated with similar precision. 


26.7.6 SARAR(p,q) model 


For time-series data, a flexible model is the autoregressive moving average 
model of orders p and q (an ARMA(P, q) model) with 


Ye = O1Yt-1 T+ + ApYt—p + x3 + Ut 
Ut = Pret oe + Pglit=q F Et 


Similarly, the SARAR(1,1) model can be generalized to an SARAR(P; 4) 
model with 


y= AyWiy +e +ApWpy + XB +91 V1 Xp +-°2 +r Vr Xrp + U 
u = pıMıu +- + pMqu +E 


where W1,..., Wp, Vi,..., Vr, Mı, ..., Mg are spatial weighting 
matrices that need not be distinct. Such more flexible models may better 
capture any spatial correlations, thereby ensuring that the underlying errors 
€; are independent, an essential assumption required for estimator 
consistency. For example, in a geospatial setting, W, may be a contiguity 
matrix, while W, may be a distance matrix. 


The gs2s1s option of the spregress command enables estimation of this 
more general model. The m1 option of the spregress command allows only 
the regressors X to have more than one spatial lag. 


26.8 Spatial instrumental variables 


The GS2SLS estimator for SAR models extends straightforwardly to models 
where some of the regressors are endogenous. We suppose that 


y = Y*7+AWy+ X8+7VX,+u 
u=pMu+e 


where we have added the N x p matrix Y* of p endogenous regressors. 
Note that no spatial lag is applied to these endogenous regressors. 


We assume that there are q > p instruments X;,,,; available for y*. The 
preceding results carry through. We again define y = Zô + u, where Z 
additionally includes Y* and ð additionally includes 7. The matrix H is 
formed from the linearly independent columns of a matrix that additionally 
includes Xinst, WXinst,---, W%Xinst- The estimator can be implemented 
using the spivregress command. 


As a purely numerical illustrative example, we consider the SAR(1) in 
mean model and assume that poverty is endogenous and can be 
instrumented by gini (which requires the identifying assumption that gini, 
the Gini coefficient of family inequality, should not be included in the 
original model). The instrument is very strong: the sample correlation 
between poverty and gini is 0.88. We expect little loss of efficiency in this 
example, and we expect OLS and Iv estimates to be similar. We obtain 


* Endogenous regressors in SAR in mean model 

spivregress hrate ln_population (poverty=gini), gs2sls dvarlag(W) 
> heteroskedastic 

(159 observations) 

(159 observations (places) used) 

(weighting matrix defines 159 places) 


Spatial autoregressive model Number of obs = 159 
GS2SLS estimates Wald chi2(3) = 62.56 
Prob > chi2 = 0.0000 
Pseudo R2 = 0.2323 
hrate Coefficient Std. err. z P>|z| [95% conf. interval] 
hrate 
poverty . 7273454 . 1056205 6.89 0.000 . 520333 . 9343579 
ln_population 1.904222 .6319827 3.01 0.003 .6655591 3.142886 
_cons -17.53529 7.172499 -2.44 0.014 -31.59313 -3.477449 
W 
hrate -.0118891 .1152612 -0.10 0.918 -.2377969 .2140186 
Wald test of spatial terms: chi2(1) = 0.01 Prob > chi2 = 0.9178 


Instrumented: poverty (W*hrate) 
Raw instruments: ln_population gini hrate:_cons 


As expected, the Iv estimates and standard errors are similar to those from 
OLS regression. 


26.9 Spatial panel-data models 


A range of models and estimators has been proposed for spatial data 
observed over several time periods. The spxt regress command enables 
estimation of one of these models. 


Specifically, the spxtregress command extends the cross-sectional 
SARAR model to panel data in the case where 1) the panel is strongly 
balanced; 2) the same spatial weighting matrix is used in each time period; 
3) the model is static (so lagged dependent variables do not appear as a 
regressor); and 4) underlying errors are 1.1.d. over time and across 
observations. 


A fixed-effects estimator allows an individual-specific intercept that is 
potentially correlated with model errors. Two variants of a random-effects 
estimator allow the model error to include an individual-specific component 
that is normally distributed. 


A panel variant of the SARAR(1,1) model specifies for the tth period 


yi =AWy, + X%:B+at+u, t=1,...,T 
u; = pMu; + € 


where using more compact notation X; is an N x K matrix that 
includes any original regressors plus any relevant spatial lags. Here yz, Uz, 
and €; are N x 1 vectors, and W and M are N x N spatial weighting 
matrices that are time invariant. The errors Eit are assumed to be 1.1.d. over 
both 7 and t. 


The key complication is the addition of an N x 1 vector œ of individual 
specific time-invariant additive effects. 


In a fixed-effects model, the time-invariant individual effects ©; may be 
correlated with regressors. Define the usual mean-differencing 
transformation matrix Q = Ir — (1/T)e,er. Premultiplying the model by 
Q eliminates the fixed effects and leads to a transformed SARAR model, but 
it induces correlation in errors over time. Lee and Yu (2010) propose an 


alternative differencing transformation without this problem. The T x T 
orthonormal eigenvector matrix of Q can be expressed as [F (er /vT)], 
where F is the T x (T — 1) submatrix corresponding to the eigenvalues 
equal to 1. Then postmultiplying y+, X+, u;, and €; by F yields transformed 
model 


yp =AWy7 +XiB+uz, t=1,...,T 
u; = pMu; + €;, 


with underlying errors €;¢ that are uncorrelated for all į; and ¢ (and 
independent under normality). 


The spxtregress command with fe option fits this model. The 
estimator is a quasi-ML estimator that maximizes the log likelihood under 
the assumption of normally distributed errors £it, but the asymptotic 
properties (as N — oo with T possibly fixed) of consistency and 
asymptotic normality require only the weaker assumption that the errors €it 
are 1.1.d. 


The spxtregress command with re option estimates models with 
random effects. The default re option supposes ©; and €:¢ are 1.1.d. normal 
and obtains the MLE. The re and sarpane1 options fit an alternative model 
where œ is dropped from the structural equation for yz and instead appears 
in the SAR model for uz, so u; = pMu; + a + c+. Again, a; and Eit are 
assumed to be 1.1.d. normal, and the MLE is obtained. 


For further details and an application, see the Stata documentation for 
the command spxt regress. 


If the only spatial complication in a panel dataset is spatial correlation 
in the error, rather than in the conditional mean, then the panel OLS 
estimator is consistent. If N — oo and T — oo with time-series 
autocorrelation disappearing after m lags, then the Newey—West-type 
standard errors of Driscoll and Kraay (1998) can be obtained using the 
community-contributed xtscc command (Hoechle 2007). This does not 
require specification of a model for the spatial correlation in the error. 


26.10 Additional resources 


A lengthy description is given in [SP] Stata Spatial Autoregressive Models 
Reference Manual. Standard references for spatial econometrics are 


and Prucha (2007) present spatial HAC standard errors following OLS and Iv 
regression. 


A good introduction to SAR models is provided in Drukker, Prucha, and 
Raciborski (2013). This illustrates now redundant community-contributed 
Stata commands that are quite similar to the subsequent Stata sp 
commands. For relevant theory, see especially Drukker, Egger, and 
Prucha (2019) for quite general SARAR models fit by least-squares methods 
and Lee (2004) for ML estimation. Gibbons and Overman (2012) provide a 
critique of the methods presented in this chapter. 


There are many community-contributed Stata spatial commands; some 
but not all have been superseded by the Stata sp commands. 


A closely related subject is network analysis. Leading textbooks include 


Graham and de Paula (2020) provide recent surveys. Hedström and 
Grund (Forthcoming) detail methods using their nwcommands package. 


26.11 Exercises 


1. Open the files mus226columbusdb.dta and 
mus226columbuscoord.dta, and provide a brief summary of their 
contents. Next, spset the data using the following commands: 
usemus226columbusdb.dta,clear and spsetID and spset,modify 
shpfile (mus226columbuscoord.dta). Finally, give grmap CRIME. Do 
residential burglaries and vehicle thefts appear to be spatially 
correlated? 

2. Continue with the same dataset as the previous question. Create a 
spatial weighting matrix (named w) that is a contiguity weighting 
matrix with default spectral normalization. Perform OLS regression of 
CRIME on household income and housing value. Comment on the fitted 
relationship. Perform the Moran test. Are the model errors spatially 
uncorrelated? Test at level 0.05. Estimate an SAR(1) in mean model. Are 
there spillover effects? Use command estat impact. Estimate an 
SAR(1) in error model, and compare estimates with those from OLS and 
SAR(1) in mean models. 

3. The following creates a heat map for median age across U.S. states. 
The datasets available at the Stata Press website are 

State 2010Census_ DP1.shp (a shape file), 

State 2010Census DP1.dbf (a datafile), and 

DP_TableDescriptions.xls (which describes the variables in the . dbf 

file). Convert the shapefile to Stata format with command 

shp2dta using State 2010Census DP1, database (usdb) 

coordinates (uscoord) genid(ID) replace, which will place the 

coordinates in uscoord.dta and the dataset in usdb. dta with Ip. Open 
file uscoord.dta, and summarize and list the first 10 observations. 

What do you learn? Open file usdb.dta, give command list ID NAME, 

and summarize variable DP0020001, which is median age in each state. 

What do you learn? Next, spset the data using commands spset ID 

and spset, modify shpfile(uscoord.dta). Finally, produce a heat 

map with the outlying states dropped using command grmap 

DP0020001 if ID!=8&ID !=186 ID !=35. What do you conclude 

about the age distribution in the United States? 


4. A peer-effects model includes the average value of variables (y or x, or 
both) as regressors. We use mus206vlss.dta with dependent variable 
pharvis, the number of pharmacy visits by an individual. Individuals 
are in communes, the peer group, with on average 148 people per 
commune. Throughout, use cluster-robust standard errors that cluster 
on commune. Create the variable pharvispeer using commands 
bysort commune: egen numincommune = count (_n) followed by 
by commune: egen sumpharviscommune = sum(pharvis) and finally 
generate pharvispeer = (sumpharviscommune- 
pharvis) / (numincommune-1). Fit an OLS regression of pharvis on 
Inhhexp and illness using cluster—robust standard errors. Provide 
interpretations of the coefficients of Inhhexp and illness. Now, add 
pharvispeer as a regressor. Does the average number of pharmacy 
visits by others in the commune appear to be related to one’s own 
pharmacy visits? One possibility is that illnesses are socially 
contagious. Create variable illnesspeer for the average number of 
illnesses of others in the commune. Rerun the initial regression with 
this variable added as a regressor. Does illness appear to be socially 
contagious? Suppose we believe illnesspeer does not directly affect 
pharvis. Then this is available as an instrument for pharvispeer. Fita 
regression of pharvis on lnhhexp, illness, and pharvispeer with 
illnesspeer an instrument for pharvispeer. 

5. A better model for the data of the previous question is a Poisson model 
because the dependent variable pharvis is a count. Repeat the same 
regressions, except use poisson rather than regress and use gmm rather 
than ivregress, based on the nonlinear moment condition 


E|z:{y; — exp(x’B)}| = 0. 


Chapter 27 
Semiparametric regression 


27.1 Introduction 


In this chapter, we present regression methods that require less model 
specification than the methods of the preceding chapters. 


We begin with nonparametric regression of y on x, with conditional 
mean E(y|x) = m(x), where the functional form for m(-) is not specified. 
The most commonly used nonparametric method is kernel-weighted local 
polynomial regression, most often local constant or local linear. An 
introductory treatment was given in sections 2.6.6 and 14.6. For scalar 
regressor x, the 1poly command implements these methods with a fixed 
bandwidth with default value determined by a plugin formula, and the 
lowess command provides a variation to local linear and local constant 
regression that uses a variable bandwidth and downweights observations 
with large residuals. 


In this chapter, we consider extension to multiple regressors. The 
npregress kernel command implements kernel-weighted local linear and 
local constant regression. For local linear regression, it also provides 
estimates of m’(x) = 0E(y|x)/Ox, enabling estimation of marginal effects. 
The bandwidths for the kernel are determined by leave-one-out cross 
classification. The npregress series command implements regression on 
higher-order polynomials or on splines. 


A major limitation of nonparametric regression is the curse of 
dimensionality. As the number of regressors grows, the fraction of the 
sample available to evaluate m(x) diminishes. For example, if with one 
regressor x we are willing to evaluate m(x) by averaging Y over values of x 
that fall in to 1 of 10 bins, then with 2 regressors, xı and z2, we evaluate 
m/(x1, z2) by averaging Y over values that fall into 1 of 102 — 100 bins. 
Theoretically, the rate of convergence of m(x) to m(x) declines, and in 
practice the estimates m(x) are too noisy unless there are many 
observations. 


The npregress kernel and npregress series commands provide 
additional commands that help overcome this limitation. For example, after 


obtaining m(x,, £2), one can compute M(xı|xr2 = 13) = M(x1, £3) or 
m(x1) = (1/N) i M(z1, £2;)- These quantities are more precisely 
estimated and can be easily plotted against x1. Stata provides commands to 
do this. Related postestimation commands additionally provide ways to 
interpret how the dependent variable changes as regressor values change. 


An alternative is to use semiparametric methods that introduce a 
parametric component that greatly reduces the dimension of the 
nonparametric component, often to dimension one. Three leading examples 
are the partial linear model, which specifies E(y|x, z) = x'G + g(z), where 
g(-) is not specified; the single-index model, which specifies 
E(y|x) = g(x’B), where g(-) is not specified; and the generalized additive 
model (GAM), where E(y|x) = gı (x1) + --- + gx (xx). In the first two 
models, the goal is to obtain a root-N consistent and asymptotically normal 
estimator for the parameters 3. Any nonparametric components are also 
consistently estimated but with rate of convergence less than root- N. 


The chapter begins with details on the methods for kernel-weighted 
local linear and local constant regression and for series regression. The 
subsequent two sections present examples for regression with, respectively, 
a single regressor and multiple regressors. The remaining sections present 
various semiparametric models and estimators. 


The machine learning literature has introduced additional flexible 
methods for regression of y on x, such as lasso, neural networks, regression 
trees, and random forests. These methods are presented in chapter 28. 


27.2 Kernel regression 


We present kernel-weighted local linear and local constant regression and 
details on kernel functions and on bandwidth choice. The discussion is 
more expansive than the introductory treatment in sections 2.6.6 and 14.6. 


27.2.1 Kernel-weighted local constant regression 


Consider the regression model y = m(x) + u, where there are k regressors 
and E(u|x) = 0, so E(y|x) = m(x). The goal is to estimate the conditional 
mean m(x) at each sample value of x, without placing any restrictions on 
the functional form for m(-). 


Suppose there are many observations on y for a given value Xo of the 
regressors. Then an obvious nonparametric estimator of m(xq) is the 
average value of y for those observations with X; = Xo. This can be 
expressed as Mm(xo) = pa i(1(x: = xo)/No} x yi, where the indicator 
function 1(A) = 1 if event A occurs and equals 0 otherwise and No is the 
number of observations with x; = Xo. 


In practice, there will generally be few to no observations with x; = Xo, 
especially if there are several regressors or continuous regressors, or both. 
A more general method additionally averages y over neighboring 
observations that have x; close to Xo, with greatest weight given to those 
observations with x; closest to Xo. 


Kernel regression, also known as local constant regression, uses the 
locally weighted average 


where the weights w(x;, Xo, h), defined after (27.1) below, increase as x; 
becomes closer to Xo and decrease as the bandwidth parameter or 
smoothing parameter h increases. 


For multivariate regressor x with K regressors %j, 7 = 1,..., K, Stata 
uses the product weighting function or product kernel function 


K 
w (x;, Xo, h = Hv Ae, Gy h j) (27.1) 


where for the jth regressor the weights are kernel weights 


sessa m= (2528) [Sm (2722) 


Kernel functions for continuous regressors were introduced in section 2.6.4 
and are defined in [R] kdensity. For discrete regressors, the Li-Racine 
kernels defined in (27.2) and (27.3) below are used. 


27.2.2 Kernel-weighted local linear regression 


The local constant regression estimator simply averages y over x values in 
the local neighborhood of Xo. For Xo close to the boundary of the range of x 
, one can average only on one side of Xo, leading to bias if the relationship 
between Y and x is not constant at the boundary. Local linear regression 
reduces this problem by allowing for linearly increasing or decreasing 
relationships near the boundary. 


Recall that ordinary least-squares (OLS) regression on only an intercept 
gives fitted value of the sample mean, and weighted least-squares 
regression on only an intercept gives fitted value of the weighted mean. 


Q) 
= 


Thus, the local constant estimator of m(x,) can be obtained as M(xo) = 
, where Q@p minimizes 


The local linear estimator at X = Xo instead minimizes with respect to 
ao and Bo the weighted sum of squares 


N 
S| w(xi, Xo, h) x {yi — Ao — (Xi — xo) By}? 


> 
| 
a 


Then the conditional mean estimate is m(xo) = Qo. 


Furthermore, this yields the partial-effects estimate M’ (xo) = Bo where 
m'(xo) = ðm(x)/Əx|ş, These estimators are consistent for m(xo) and 
m’ (xo) as the sample size N — oo and the bandwidth h — 0 at an 
appropriate rate. 


Note that for some values of xo, the local linear estimate M(xo) may 
not exist, and m(xo) is therefore not identified. Consider regression on a 
single continuous regressor. To calculate @ and Bo by linear regression at a 
particular point £o, we need at least two observations £i for which the 
kernel weights Kk {(x; — xo) /h} are nonzero. This may not occur in places 
where the data on x are widely spread and the bandwidth } is small. Such 
lack of identification becomes more likely as more continuous regressors 
are added. 


27.2.3 Kernel functions 


For continuous regressors, Stata offers a range of kernel functions that are 
defined in [R] kdensity. The most commonly used are the rectangular, 


Epanechnikov, and Gaussian kernels. 


The rectangular or uniform kernel, the kernel (rectangle) option, 
specifies that K(z) = (1/2)1(z < 1), where 1(A) = 1 if event A occurs 
and 1(A) = 0 otherwise. Then one simply averages over those observations 
with |(x; — x9)/h| < 1 or, equivalently, with |x; — zo| < h. 


There are two Stata variants of the Epanechnikov or parabolic kernel. 
The option kernel (epanechnikov), the default for continuous regressors, 
uses the original parameterization 
K(z) = (3/4/5)(1 = (is) 2") x 1[z < v5]. The kernel (epan2) option 
uses the more commonly used parameterization 
K(z) = (3/4)(1 — 2) x 1[z < 1}. Using the kernel (epanechnikov) 
option with bandwidth } — b is equivalent to using the kernel (epan2) 
option with bandwidth h = \/5p. The Epanechnikov kernel is the optimal 
kernel (minimizing mean-squared error [MSE]) for density estimation but 
not necessarily for kernel regression. 


The Gaussian or normal kernel specifies K @)=( /J2n)e*/ 2, 
Because K (z) is then positive for all z, it is less likely to lead to lack of 
identification of m(xo) for some values of Xo. 


For discrete binary regressors, the default kernel for the npregress 
command is to use the Li-Racine kernel. For scalar regressor x, this defines 


1 if £i = £o 


K (a; — zo, h) = l h otherwise, 0<h<l (27.2) 


When h = 0, the Li—Racine kernel reduces to the indicator function 
1(x; = xo). In the simplest case of nonparametric regression of Yi on 
binary regressor d;, this leads to m;(0) equal to the average of y when 
d; = 0 and m,(1) equal to the average of y when d; = 1; the same result is 
obtained using the dkernel (cellmean) option. 


When h = 1, the Li-Racine kernel is a uniform kernel, always equal to 
1. Then nonparametric regression of y on binary regressor d leads to 
M(0) = m(1) = y. 


For ordered discrete regressors, the default kernel is the Li—Racine 
kernel. For regressor j taking possible values 0,...,m, this defines 


K(a;—29,h) = hls] O<A<1 (27.3) 


This reduces to the kernel (27.2) in the binary case. Similar to the binary 
case, when h = 0, the kernel yields cell means, equivalent to the 

dkernel (cellmean) option, and when h = 1, the kernel yields the overall 
mean. When the ordered discrete regressor takes many values, an 
alternative is to treat it as a continuous regressor. 


27.2.4 Bandwidth choice 


Crucial ingredients are the smoothing parameters hj, 7 = 1,..., K, that 
determine the bandwidth. Smaller values of h; lead to smaller bias in the 
estimate M (Xo) because greater weight is then placed on observations with 
x; close to Xo. But smaller values of h; also lead to greater variance in the 
estimate (x) because then fewer observations are used in computing 
M(xo). 


This tradeoff is balanced by minimizing MSE, the sum of squared bias 
and variance. Empirically, this is implemented by choosing the bandwidths 
for each regressor to jointly minimize the leave-one-out cross-validation 
(LOOCV) measure 


where evaluation of m(xo) is at each of the NV sample values X1,..., XN, 
and m_, is the estimate m(x,) of m(x;) obtained using all observations but 
the jth observation. 


In addition to calculating the conditional mean estimate m(-), the 
npregress kernel command estimates partial effects, the derivatives of 
m(x) with respect to x. These estimates are obtained using different 
bandwidths, also calculated using cross-validation. The 1poly command 
instead uses a plugin value for the bandwidth. 


27.2.5 The npregress kernel command 


The npregress kernel command has syntax 
npregress kernel depvar indepvars [ of | [ in | le options | 


The default is to use the local linear kernel estimator. The local constant 
kernel estimator is obtained using the estimator (constant) option; then 
the noderivatives option is additionally needed if regressors are 
continuous. The options include kernel (), which specifies the kernel 
function to be used for all the continuous regressors. Available kernels are 
defined in [R] kdensity; the default is kernel (epanechnikov). The 
dkernel() option specifies the kernel function to be used for all the discrete 
regressors, which need to be identified using the i. prefix. The default is 
dkernel (liracine), while dkernel (cellmean) uses cell means. The 

vce (bootstrap) option provides bootstrap standard errors and associated p 
-values and confidence intervals for the averages of m(xg) and M’ (xo). 
Other options include ones that allow direct specification of bandwidths for 
conditional mean estimates and for partial-effect estimates. The evaluation 
points Xo are the distinct sample values of x. 


Note that the flexibility of nonparametric regression comes at the cost of 
increased variability and bias in finite samples, due to local weighting. The 
asymptotic theory assumes that the bandwidth h — 0 as the sample size 
N — oo. With K regressors, the optimal bandwidth that minimizes MSE is 
O(N-\/(K+4)), in which case M(xo) converges to m(xo) at rate 


O(N-2/(K+4)), rather than the usual O(N~1/2) for ots. And the bias in 
M(xo) is also O(N~2/(K+4)). 


The estimates 7(xo) and M (xo) are asymptotically normally 
distributed. The default for the npregress command is to present no 
estimates of precision. The vce (bootstrap) option provides bootstrap 
standard errors and associated p-values and confidence intervals for the 
averages of M(xo) and M’ (xo). In presenting confidence intervals, Stata 
follows the standard approach of ignoring any possible bias in ™(xo) and 
m’ (xo), though in practice such bias will be nonnegligible unless the local 
average is over many observations. 


27.2.6 Kernel-weighted local linear m-estimation 


Local polynomial estimators can be generalized from least-squares 

estimators to m-estimators such as maximum likelihood estimators that 
. . N 

maximize $`; _; q(yi|xi, 0). 


A local m-estimator at x = Xo maximizes with respect to Oo the 
weighted sum 


>, w(x, xo, h) x qf{yi|(xi — xo), Oo} 


i=l 
where w(-) are kernel weights. 


An example given in section 17.8.4 uses a local logit model to obtain 
nonparametric estimates of Pr(y; = 1|x;). 


27.3 Series regression 


A series estimator fits a model with regressors that are based on a series 
expansion in x. Leading examples are polynomials in x, introduced in 
section 14.4, and regression splines in x, introduced in section 14.5. 


A nonparametric series estimator allows the order of the polynomial or 
the number of knots, or both, to increase with sample size, in which case 
these quantities become the analogue of the bandwidth in kernel regression 
and are determined by data-driven methods such as cross-validation or 
information criterion. 


27.3.1 Series regression model 


A series expansion model or basis function model or sieve regression model 
specifies 


E(yi|xi) = S50 X) 


where the M functions g;(-) are series expansions of the underlying X;. 


In the scalar regressor case, the g;(-) may be a polynomial or regression 
spline in x;. For multiple regressors, these terms are fully interacted in the 
case of regression splines, while for a polynomial of degree K, the sum of 
the powers of cross-product terms is restricted to be at most Kk. 


A series regression model can be directly estimated by OLS regression, 
and appropriate standard errors are easily calculated. 


27.3.2 The npregress series command 


The npregress series command has syntax 


npregress series depvar indepvars lif | [ in | | weight | Ee options | 


For polynomial regression, the polynomial () option fits a polynomial of 
data-determined degree, while the polynomial (#) option fits a polynomial 
of specified degree. 


For spline regression, the spline option fits a third-order natural spline, 
and the spline (#) option fits a natural spline of specified order. The 
command also estimates B-splines, using the bspline and bspline (#) 
options. For spline regression, the number of knots can be data determined 
or can be specified. In the latter case, the knots (#) option specifies the 
number of knots, or the knotsmat () option provides values for the knots. 


The criterion() option specifies the method used to determine the 
polynomial degree or the number of knots, if these have not been specified. 
The available methods are Loocv, generalized cross-validation, Akaike’s 
information criterion, Bayesian information criterion, and Mallow’s 
criterion. 


The vce() option enables inference based on heteroskedastic—robust 
standard errors (the default), homoskedastic standard errors, or bootstrap 
standard errors (including cluster bootstrap standard errors). 


The asis() option enables regressors to be included as independent 
regressors, rather than as polynomials or spines. The nointeract () option 
enables a polynomial or spline in a variable to appear on its own rather than 
interactively. 


27.4 Nonparametric single regressor example 


We use mus202psid92m.dta, which was introduced in section 2.4. To speed 
analysis, we analyze a 5% subsample, and to make graphs easier to read, we 
drop a few observations with annual earnings over $145,000. We present the 
case of a single regressor in some detail before extension to multiple 
regressors. 


27.4.1 Basic kernel-weighted local linear regression 


Summary statistics for the level of annual earnings and various regressors 
(annual hours, age in years, and four education categories) in the subsample 
are 


. * Read in data and choose a 5% sample 
. qui use mus202psid92m 


. drop if earnings==0 | earnings==. | earnings > 145000 
(521 observations deleted) 


set seed 10101 


. keep if runiform() < 0.05 
(3,588 observations deleted) 


sum earnings hours age edcat ibn.edcat 


Variable Obs Mean Std. dev. Min Max 

earnings 181 29883.24 21727.37 700 132000 

hours 181 2094.37 671.4728 180 3840 

age 181 37 .92265 5.628754 30 50 

edcat 181 2.618785 1.018431 1 4 
edcat 

1 181 .1491713 . 3572454 (0) 1 

2 181 .3314917 .4720552 (0) 1 

3 181 .2707182 . 4455634 (0) 1 

4 181 . 2486188 .4334112 (0) 1 


OLS regression of earnings on hours yields a slope coefficient of 15.14 
and R? = 0.219. 


. * Single regressor: OLS 
. regress earnings hours, vce(robust) 


Linear regression Number of obs = 181 
F(1, 179) = 32.60 
Prob > F = 0.0000 
R-squared = 0.2188 
Root MSE 19257 
Robust 
earnings | Coefficient std. err. t P>|t| [95% conf. interval] 
hours 15.1373 2.651025 5.71 0.000 9.906017 20.36858 
_cons -1819.867 4932.261 -0.37 0.713 -11552.72 7912.99 


We then obtain the local linear estimator using the epan2 kernel. 


* Single regressor: Local linear with epan2 kernel and no standard errors 
. npregress kernel earnings hours, kernel (epan2) 


Computing mean function 
Minimizing cross-validation function: 
criterion = 


Iteration 
Iteration 


0: 
1: 


Cross-validation 
Cross-validation 


criterion 


50.618809 
50.618809 


warning: 1 observation was not used to compute the mean function because it 
violated the model identification assumptions. This observation is 
marked as 1 in the system variable _unident_sample. You may use the 


unidentsample() option to use a different variable name. 


Computing optimal derivative bandwidth 
Iteration 0: Cross-validation criterion 
Iteration 1: Cross-validation criterion 
Iteration 2: Cross-validation criterion 
Iteration 3: Cross-validation criterion 
Iteration 4: Cross-validation criterion 
Iteration 5: Cross-validation criterion 
Iteration 6: Cross-validation criterion 
Iteration 7: Cross-validation criterion 
Iteration 8: Cross-validation criterion 
Iteration 9: Cross-validation criterion 
Iteration 10: Cross-validation criterion 
Iteration 11: Cross-validation criterion 
Iteration 12: Cross-validation criterion 
Iteration 13: Cross-validation criterion 
Iteration 14: Cross-validation criterion 
Iteration 15: Cross-validation criterion 
Iteration 16: Cross-validation criterion 
Iteration 17: Cross-validation criterion 
Iteration 18: Cross-validation criterion 
Iteration 19: Cross-validation criterion 
Iteration 20: Cross-validation criterion 
Iteration 21: Cross-validation criterion 
Bandwidth 
Mean Effect 
hours 237 .4066 1149.882 


51380.609 
2899.7862 
2899.7862 
2899.7862 
2899.7862 
2899.7862 
2544.6791 
2544.6791 
2328.178 
2328.178 
2328.178 
2284.5741 
2284.5741 
2265.005 
2265.005 
2253.2349 
2253.2349 
2246.9712 
2246.9712 
2246.9712 
2245 .4269 
2245.4269 


Local-linear regression Number of obs 180 
Kernel : epan2 E(Kernel obs) 180 
Bandwidth: cross-validation R-squared 0.2676 

earnings Estimate 
Mean 

earnings 29869.77 
Effect 

hours 14.27923 


Note: Effect estimates are averages of derivatives. 


Note: You may compute standard errors using vce(bootstrap) or reps(). 


The initial output reports the iterative decline in the Loocv criterion as the 
bandwidth is changed to ultimately yield the optimal bandwidth for m(x) 
and for M’ (x). One observation needed to be dropped; this observation is 
identified below. The table then reports the optimal bandwidths. The first 
bandwidth, the mean bandwidth, is the one that minimizes Loocv for 
prediction of the conditional mean m(x). The second bandwidth is used for 
estimating the partial effects m’(x) and is wider because the rate of 
convergence for M(x) is slower than that for m(x). In this example, it is 
exceptionally wide because 2h ~ 2300 is more than 3 times the standard 
deviation of hours. 


The key results are then given. The nonparametric estimates lead to 
R? = 0.268, a much better fit than OLS, which had R2 = 0.219 (albeit fit on 
all 181 observations). The average predicted value of the fitted mean, 
(1/N) D ı M(a;), is 29869.77, so on average, predicted earnings are 
$29,870. The average of the partial effect, (1/N) aa M (x;), is 14.28, so 
on average over the sample, an additional hour of work is associated with a 
$14.28 increase in earnings. This is close to the OLS estimate (fit on all 181 
observations) of 15.13. 


To obtain measures of the precision of the estimates, we need to use the 
option vce (bootstrap). We obtain 


. * Single regressor: Local linear with epan2 kernel and bootstrap standard errors 
. npregress kernel earnings hours, kernel (epan2) 
> vce(bootstrap, seed(10101) reps(400) nodots) 


Bandwidth 


Mean Effect 


hours 237 .4066 1149.882 


Local-linear regression Number of obs = 180 
Kernel : epan2 E(Kernel obs) = 180 
Bandwidth: cross-validation R-squared = 0.2676 
Observed Bootstrap Percentile 
earnings estimate std. err. Zz P>Izl [95% conf. interval] 
Mean 
earnings 29869.77 1544.925 19.33 0.000 26881.36 32934.34 
Effect 
hours 14.27923 3.021378 4.73 0.000 8.151543 19.69207 


Note: Effect estimates are averages of derivatives. 


Note that the output no longer includes a log of the iterations to determine 
the optimal bandwidth. Now, additionally, we find that the average partial 
effect of 14.28 has standard error 3.02 and is highly statistically significant 
with ¢ = 4.73. The flexibility of local linear regression comes at the expense 
of precision. Thus, the standard error for the average partial effect of hours is 
3.02 compared with a standard error of 2.65 for the OLS slope coefficient 
(using all 181 observations). 


Determining the optimal bandwidths can require considerable 
computation. These optimal bandwidths are held constant throughout the 
bootstrap replications, greatly speeding up computation that nonetheless can 
take considerable time. The default bootstrap is a pairs bootstrap. The option 
vce (bootstrap, cluster (varlist) ) implements a cluster pairs bootstrap 
that gives cluster—robust standard errors with clustering on the variable given 
in varlist. 


27.4.2 Observations not identified 


One observation was dropped because of inability to fit the model, whereas 
OLS can be fit using all observations. In general, observations are more likely 


to be dropped the smaller the sample size and the greater the number of 
regressors. 


One can do analysis with this observation dropped, but in general there is 
a reluctance to drop observations. And for some purposes, predictions for all 
observations may be needed. 


To obtain predictions for the full sample, one can proceed in several 
ways. First, a wider bandwidth can be used, one wide enough that no 
observations are dropped. Second, an alternative kernel may be used. In 
particular, the Gaussian kernel sets K (z) > 0 for all z, whereas the other 
kernels set K (z) = 0 for larger values of z. Third, one can use the local 
constant estimator rather than the default local linear estimator. 


In the current example, the bandwidth for M(x) is 237.4, so a local linear 
estimate is not possible for any observation on hours for which there is no 
other observation within 237.4 hours. We first identify that observation. 


. * The nonidentified observation 
. gui npregress kernel earnings hours, kernel (epan2) 


. list earnings hours if _unident_sample == 1, clean 
earnings hours 
72. 90500 3840.00 


. list hours if hours > 3590, clean 


hours 
72. 3840.00 
148. 3598.00 


The problem observation is that with hours = 3840. The closest observation 
has hours = 3598, and 3840 — 3598 = 242 > 237.4. 


OLS estimates with this observation dropped can be compared directly 
with the preceding nonparametric estimates. 


. * OLS with the nonidentified obervation dropped 


. regress earnings hours if _unident_sample == 0, vce(robust) 
Linear regression Number of obs = 180 
F(1, 178) = 29.18 
Prob > F = 0.0000 
R-squared = 0.1984 
Root MSE = 19132 

Robust 

earnings | Coefficient std. err. t P>|t| [95% conf. interval] 
hours 14.36874 2.659945 5.40 0.000 9.119659 19.61783 
_cons -407.6364 4934.983 -0.08 0.934 -10146.24 9330.964 


The nonparametric estimates fit better with R2 — 0.268 compared with 

R? = 0.198 for oLs. The average of the partial effect, (1/N) Y2; M (ai); 
is 14.28, compared with the OLs slope estimate of 14.37. The standard error 
for the average partial effect of hours is 3.02 compared with a standard error 
of 2.66 for the OLS slope coefficient. 


The following three variations of the npregress command lead to 
identification for all observations. 


* Three ways to obtain identification for all observations 
. qui npregress kernel earnings hours, kernel(epan2) meanbwidth(243, copy) 


. Matrix list e(b) // Wider bandwidth for same kernel 


e(b) [1,2] 
Mean: Effect: 
earnings hours 
y1 30211.289 15.25212 


. qui npregress kernel earnings hours, kernel(gaussian) 
. matrix list e(b)  // Different unbounded kernel 


e(b) [1,2] 
Mean: Effect: 
earnings hours 
yi 30168.583 15.895576 


. qui npregress kernel earnings hours, kernel(epan2) estimator (constant) 
> noderivatives // Local constant rather than local linear 


. matrix list e(b) 


symmetric e(b) [1,1] 
Mean: 
earnings 
yi 29681.41 


The dropped observation had relatively high earnings, so using all 181 
observations leads to a higher average fitted mean for the two local linear 
estimations. The local constant estimated average fitted mean of 29681 is 
lower because it does not allow for an increasing relationship in hours at 
high values of hours. And for local constant regression with a continuous 
regressor, partial effects are no longer available. 


27.4.3 Trimmed means 


Note that even if M(x) is identified for all observations, it may be very 
noisily estimated in regions where the data on x are sparse, leading to 
imprecision of the average mean (1/N) se Cane 


A common procedure is to compute a trimmed mean that drops 
observations for which f(a), the estimated density of x, is small. The 
following code computes the trimmed mean, dropping the 5% of 
observations with the smallest f(x) where f(x) for hours is calculated 
using the kdensity command. Here the kernel and bandwidth are specified 
to be the same as those used by the npregress command. 


. * Trimmed mean dropping 5% of observations with lowest f_hat (x) 
. qui npregress kernel earnings hours, kernel(epan2) meanbwidth(243, copy) 


. qui predict m_earnings 
. kdensity hours, kernel(epan2) bwidth(243) at(hours) generate(hours_x hours_d) 
. qui centile hours_d, centile(5) 


. list hours_d hours earnings m_earnings if hours_d < r(c_1), clean 


hours_d hours earnings m_earn’s 
38. . 00005703 746.00 8000 6870.002 
62. . 00007799 180.00 729 1375.756 
72. .00001719 3840.00 90500 90500 
99. . 0000797 193.00 989 2233.159 
105. .00008132 252.00 15000 5635.232 
110. . 00003947 520.00 4696 4430.957 
117. . 00007686 315.00 4832 5764.087 
118. . 00004466 600.00 3500 4838.011 
168. . 00008169 220.00 700 3875.243 
. sum m_earnings if hours_d >= r(c_1) 
Variable Obs Mean Std. dev. Min Max 
m_earnings 172 31062.33 9361.985 7785.32 69356.19 
. sum m_earnings 
Variable Obs Mean Std. dev. Min Max 
m_earnings 181 30211.29 11573.35 1375.756 90500 


All but one of the observations dropped in computing the trimmed mean 
have low hours and associated low earnings. Thus, the trimmed mean fitted 
earnings (31,062) are higher than the untrimmed mean (30,211). 


The original npregress estimates that dropped the observation with 
hours=3840 can also be seen as a form of trimming; this observation has the 
lowest value (0.00001719) for the fitted density. 


27.4.4 Conditional means and partial effects for different regressor 
values 


Rather than sample averages of estimated conditional means and partial 
effects, interest often lies in estimated conditional means and partial effects 
for specific values of the regressor. 


The following code uses the margins postestimation command to 
calculate the fitted mean M(zxo) for hours equal to 1,000, 2,000, and 3,000, 


along with 95% confidence intervals. 


. * Compute predicted mean at hours = 1000, 2000, 3000 

. qui npregress kernel earnings hours, kernel (epan2) 

> vce(bootstrap, seed(10101) reps(400)) 

. margins, at (hours=(1000(1000)3000)) vce(bootstrap, seed(10101) reps(400) nodots) 


Adjusted predictions Number of obs 180 
Replications = 400 


Expression: Mean function, predict() 
1._at: hours = 1000 
2._at: hours = 2000 
3._at: hours = 3000 


Observed Bootstrap Percentile 
margin std. err. Zz P>lz| [95% conf. interval] 
at 
1 9332.563 2394.826 3.90 0.000 3784.188 12254.2 
2 28084.98 1820.559 15.43 0.000 24793.01 32087.81 
3 41566.51 12190.7 3.41 0.001 30744.03 54574.63 


For example, the predicted conditional mean earnings at 2,000 hours is 
$28,085, with 95% confidence interval [24793, 32088]. The estimates are 
much more precise at 2,000 hours, where the data on annual hours are 
relatively dense, than at 1,000 hours and 3,000 hours, where the data on 
hours are more sparse. 


We next present marginal effects or partial effects computed as AM(<£o), 
the change in the estimated conditional mean as £o changes. The 
contrast (atcontrast (ar) ) option of the margins command calculates the 
change in the prediction when the regressor value changes from one value to 
the next. The following code computes this, along with 95% confidence 
bands. 


. * Compute predicted effect of change in hours of 1,000 hours 
. margins, at (hours=(1000(1000)3000)) contrast (atcontrast (ar) ) 


> vce(bootstrap, seed(10101) reps(400) nodots) 
Contrasts of predictive margins Number of obs = 180 
Replications = 400 


Expression: Mean function, predict() 
1._at: hours = 1000 
2._at: hours = 2000 
3._at: hours = 3000 


df chi2 P>chi2 

_at 
(2 vs 1) 1 37.31 0.0000 
(3 vs 2) 1 1.21 0.2712 
Joint 2 39.32 0.0000 


Observed Bootstrap Percentile 
contrast std. err. [95% conf. interval] 
_at 
(2 vs 1) 18752.42 3069. 863 14353.92 25744.2 
(3 vs 2) 13481.53 12252.35 909.9065 26405.05 


For example, the preceding margins command computed values of 9,333 at 
1,000 hours and 28,085 at 2,000 hours, so a 1,000-hour increase from 1,000 
hours to 2,000 hours increases earnings by 28085 — 9333 = $18, 752. The 
95% confidence interval is [14354, 25744]. 


27.4.5 Plots for different regressor values 


The marginsplot command can be used to plot predicted conditional means 
and marginal effects for various values of the regressor. 


The following code produces a plot of the relationship between estimated 
mean earnings and hours as hours increase in increments of 100 from 400 to 
3,200, along with 95% pointwise confidence intervals and a scatterplot of 
the data. 


. * Plot predictions at many levels of hours along with 95% 
> confidence intervals 
. qui npregress kernel earnings hours, kernel (epan2) 


. qui margins, at (hours=(400(100) 3200) ) 
> vce(bootstrap, seed(10101) reps(400) nodots) 


. marginsplot, legend(off) 
> addplot (scatter earnings hours if earnings < 80000, msize(vsmal1) ) 


Variables that uniquely identify margins: hours 


The plot is given in the first panel of figure 27.1. The confidence intervals 
are narrower in regions where the data are denser, with more observations 
over a local range of hours. The relationship appears to be linear, aside from 
at high values of hours where the confidence intervals are very broad. 
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Figure 27.1. Local linear regression of earnings on hours with 
bootstrap confidence intervals 


We next plot the corresponding marginal effects, computed as Am(x0), 
using the contrast (atcontrast (ar) ) option of the margins command, 
along with 95% confidence bands. We have 


* Plot effect of change in hours of 100 hours along with 95% 
> confidence intervals 
. qui margins, at (hours=(400(100)3200)) contrast (atcontrast (ar) ) 
> vce(bootstrap, seed(10101) reps(400) nodots) 


. marginsplot, yline(0) 


Variables that uniquely identify margins: hours 


The plot is given in the second panel of figure 27.1. 


27.4.6 Comparison with the Ipoly command 


If interest lies only in a scatterplot of y on a single variable x with fitted 
nonparametric regression line, then the 1poly command is adequate and is 
much faster than the npregress kernel command for the following reasons. 
First, the 1poly command default is to estimate the conditional mean at 50 
equally spaced data points (for N > 50), whereas the npregress command 
evaluates at each of the N observations. Second, the 1poly command uses a 
plugin estimate of the bandwidth (see [R] Ipoly for the method), whereas the 
npregress command uses computationally intensive cross-validation. 


If additionally confidence bands are desired, the 1poly command is 
again much quicker because it does not use the bootstrap to obtain standard 


errors. 


For a specific choice of kernel, bandwidth, and polynomial degree, the 
two commands give essentially the same plot of the fitted conditional mean, 


aside from the afore-mentioned evaluation at different values of the 


regressor. This is demonstrated in end-of-chapter exercise 2. 


27.4.7 Basic series regression 


As an example of series regression, we present a third-order regression 
spline with the number of knots determined using cross-validation. 


* Third-order natural spline with number of knots determined by 


> cross-validation 


. npregress series earnings hours, spline criterion(cv) 


Computing approximating function 


Minimizing cross-validation criterion 


Iteration 0: Cross-validation criterion = 3.89e+08 
Iteration 1: Cross-validation criterion = 3.81e+08 


Computing average derivatives 


Cubic-spline estimation Number of obs = 181 
Criterion: cross-validation Number of knots = 3 
Robust 
earnings Effect std. err. Zz P>|zl [95% conf. interval] 
hours 14.4425 4.912036 2.94 0.003 4.815081 24.06991 
Note: Effect estimates are averages of derivatives. 


The average marginal effect is similar to that from npregress kernel. 
Additional computations find R2 — 0.259 compared with R2 = 0.268 from 
npregress kernel and R? = 0.198 from OLS. 


The margins and marginsplot postestimation commands that are 
available following the npregress kernel command can also be used 
following the npregress series command. For example, 


* Compute predicted mean at hours = 1000, 2000, 3000 
. margins, at (hours=(1000(1000) 3000) ) 


Adjusted predictions Number of obs = 181 
Model VCE: Robust 


Expression: Mean function, predict() 
1._at: hours = 1000 
2._at: hours = 2000 
3._at: hours = 3000 


Delta-method 


Margin std. err. Zz P>lz| [95% conf. interval] 

_at 
1 10439 1709.75 6.11 0.000 7087 .948 13790.04 
2 27626.23 1886.063 14.65 0.000 23929.62 31322.85 
3 35240.05 4833.842 7.29 0.000 25765.9 44714.21 


27.5 Nonparametric multiple regressor example 


We consider regression of earnings on hours, age, and a set of indicator 
variables for educational level, specifically, less than 12 years of education, 
12 years, 13—15 years, and 16 years and more. 


Nonparametric local linear regression with the kernel (epan2) option 
yields 


* Multiple regressor: Local linear with epan2 kernel and bootstrap standard 
> errors 
. npregress kernel earnings hours age i.edcat, kernel (epan2) 
> vce(bootstrap, seed(10101) reps(400) nodots) 


Bandwidth 


Mean Effect 


hours 319.5236 454.4498 


age 2.67847 3.809516 
edcat .5 .5 
Local-linear regression Number of obs = 164 
Continuous kernel : epan2 E(Kernel obs) = 164 
Discrete kernel : liracine R-squared = 0.6791 
Bandwidth : cross-validation 
Observed Bootstrap Percentile 
earnings estimate std. err. Zz P>lz| [95% conf. interval] 
Mean 
earnings 31542.15 1955.474 16.13 0.000 27801.93 34943.35 
Effect 
hours 5.489886 22.30647 0.25 0.806 -64.18196 30.25034 
age -475.8053 1142.455 -0.42 0.677 -2825.904 904.5559 
edcat 
(2 vs 1) 3534.147 2522.855 3.38 0.001 2675.327 12258.37 
(3 vs 1) 15411.41 5055.197 3.05 0.002 2832.935 23302.27 
(4 vs 1) 22813.51 7684.48 2.97 0.003 3604.087 34439 .3 


Note: Effect estimates are averages of derivatives for continuous covariates and 
averages of contrasts for factor covariates. 


. count if _unident_sample == 1 
17 


The first table gives the separate bandwidths for the two continuous 
regressors and the discrete ordered regressor, the bandwidths being 
established by cross-validation. Again, the bandwidths are larger for the 
partial effects than for the conditional mean. The model fits quite well with 
R2 = 0.679. The npregress command with bootstrap option does not state 
that some observations were not identified, but because fitted means were 
computed for only 164 of the 181 observations, it must be that the mean was 
not identified for 17 observations, compared with 1 observation in the case 
of a single regressor. 


The average partial effect of hours worked is much smaller once age and 
educational categories are added as regressors and is now statistically 
insignificant. 


The most statistically significant average partial effects are those for the 
educational category variables. To perform a joint test of their significance, 
we verify the appropriate names for the stored variables using command 
matrix list e(b) and then perform the joint test. 


. * Test joint statistical significance of education categories 
. matrix list e(b) 


e(b) [1,6] 
Mean: Effect: Effect: Effect: Effect: Effect: 
r2vsi. r3vsi. r4vsi. 
earnings hours age edcat edcat edcat 


yi 31542.151 5.4898856 -475.80533 8534.147 15411.408  22813.509 


. test r2vsi.edcat r3vsi.edcat r4vsi.edcat 


( 1) [Effect]r2vsi.edcat = 0 
( 2) [Effect]r3vsi.edcat = 0 
( 3) ([Effect]r4vsi.edcat = 0 
chi2( 3) = 12.33 

Prob > chi2 = 0.0063 


The educational category regressors are jointly statistically significant at 5%. 


OLS estimation over the same 161 observations as identified using the 
preceding npregress kernel command yields 


. * Multiple regressor: OLS on same sample as npregress kernel 


. regress earnings hours age i.edcat if _unident_sample == 0, vce(robust) 
Linear regression Number of obs = 161 
F(5, 155) = 11.24 
Prob > F = 0.0000 
R-squared = 0.2989 
Root MSE = 18184 

Robust 
earnings | Coefficient std. err. t P>|t | [95% conf. interval] 
hours 16.60835 3.457316 4.80 0.000 9.778813 23.43789 
age 146.0515 259.0136 0.56 0.574 -365.6007 657.7037 
edcat 

2 10053.61 3153.741 3.19 0.002 3823.752 16283.47 
3 11801.48 3204.822 3.68 0.000 5470.719 18132.25 
4 19637.09 4155.196 4.73 0.000 11428.97 27845.21 
-cons -21118.26 11896.43 -1.78 0.078 -44618.3 2381.788 


OLS fits worse than the local linear regression, with R2 = 0.300 compared 
with 0.679. The OLs coefficients indicate larger average effects than local 
linear regression. The OLS standard errors are smaller for continuous 
regressors, as expected, and for the education categories aside from the 
second category. 


Nonparametric series estimation using third-order regression splines 
yields 


. * Multiple regressor: OLS on same sample as npregress kernel 


. npregress series earnings hours age, spline criterion(cv) asis(i 


Computing approximating function 


Minimizing cross-validation criterion 


Iteration 0: 


Cross-validation criterion 


Computing average derivatives 


Cubic-spline estimation 


Criterion: cross-validation 


earnings 


hours 
age 


edcat 
(2 vs 1) 
(3 vs 1) 
(4 vs 1) 


Effect 


13.16959 
30.57134 


13047 . 44 
14182.32 
21772.14 


Robust 


std. err. 


2.419944 
710.2491 


3051.515 
3291.746 
3826 . 244 


7.42e+08 


Number of obs 
Number of knots 


.28 
31 
.69 


P>|z| 


0.000 
0.966 


0.000 
0.000 
0.000 


Note: Effect estimates are averages of derivatives. 


Only one knot is chosen. Additional computation yields R2 = 0.427, 


[95% conf. 


8.426587 
-1361.491 


7066.586 
7730.612 
14272.84 


substantially higher than oLs but lower than from kernel regression. 


.edcat) 


181 


interval] 


17.91259 
1422.634 


19028.3 
20634.02 
29271.45 


27.6 Partial linear model 


The partial linear regression model is the regular linear regression model, 
except that one (or more) of the regressors appears in the model nonlinearly 
as g(z), where the function g(-) is unspecified. We focus on the case of 
scalar z, with 


y=at+xB+g(z)+u 


The goal is to obtain a root- N consistent and asymptotically normal 
estimator of 3, given the usual assumption that E (u|x, z) = 0. This provides 
an estimate of the marginal effect on E (y|x, z) of changes in x. 


Taking conditional expectations with respect to z, we have 
E(y|z) = a+ E(x|z)'B + g(z) 


given the usual assumption that F'(u|z) = 0. Subtracting the two equations 
yields 


{y — E(ylz)} = {x — E(x|z)}'B) 


which has eliminated g(z). Robinson (1988) proposed this differencing 
transformation and estimation of G from the following OLS regression, 


where the OLS regression does not include a constant and Ely ilzi) and 


E(a1;\z:),-.-, E(x kiļzi) are the fitted values from univariate nonparametric 


regression such as local constant or local linear. The intercept a is not 
identified separately from g(z). Nonetheless, it is convenient to center g(z) 
around an intercept. Then, 


F(z) = yi — 8- xj B 


where & is the estimated intercept from OLS regression of y on an intercept, x 
and z, and @ is the estimate from (27.4). 


The community-contributed semipar command (Verardi and 
Debarsy 2012) implements this procedure, using the 1po1y command to 
obtain the nonparametric fitted values. The default is to use local linear 
kernel regression with a Gaussian kernel and bandwidth that is calculated by 
the plugin formula used by the 1poly command. The trim(#) option trims 
by discarding those observations where the data on z are sparse, with kernel 
density estimate below a user-provided threshold. 


As an example, for the rest of this chapter, we generate the same data as 
used in sections 14.4—-14.6. The model is y = 1+ 2, + £2 + g(z) +u, 
where g(z) = z + z?. Specifically, 


. * Generated data: y = 1 + 1*x1 + 1*x2 + g(z) + u where g(z) = z+ z°2 
. Clear 


. set obs 200 
Number of observations (_N) was 0, now 200. 


. set seed 10101 

. generate x1 = rnormal() 

. generate x2 = rnormal() + 0.5*x1 
. generate z = rnormal() + 0.5*x1 
. generate zsq = z^2 


. generate y = 1 + x1 + x2 + z + zsq + 2*rnormal() 


The semipar command, with options used to obtain heteroskedastic— 
robust standard errors and more detailed labeling of the associated figure 
that is created, yields 


. * Robinson estimator given unknown g(z) using community-contributed 

> command semipar 

. semipar y x1 x2, nonpar(z) robust ci title("Partial linear: f(z) against z") 
> ytitle("y-b*x and f(z)") xtitle("z") 


Number of obs = 200 
R-squared 0.3918 


Adj R-squared = 0.3857 


Root MSE = 1.9925 
y | Coefficient Std. err. t P>|t| [95% conf. interval] 
xl . 9029397 . 1516576 5.95 0.000 . 6038683 1.202011 
x2 . 9667308 . 1258653 7.68 0.000 . 7185221 1.214939 


The coefficient estimates are close to the data-generating process (DGP) 
values of one. The standard errors are asymptotic standard errors that do not 
control for any estimation error in the first-stage kernel estimation. 


For comparison, we consider OLS regression on the DGP model, where we 
use knowledge that g(z) = z + 2?. 


. * OLS estimation given knowledge of the DGP with g(z) = z + z°2 
. regress y x1 x2 z zsq, vce(robust) 


Linear regression Number of obs = 200 
F(4, 195) = 138.56 
Prob > F = 0.0000 
R-squared = 0.6860 
Root MSE = 2.0402 
Robust 
y | Coefficient std. err. t P>|t| [95% conf. interval] 
x1 . 879966 . 1539411 5.72 0.000 .5763626 1.183569 
x2 . 9949839 . 1292254 7.70 0.000 . 740125 1.249843 
Zz 1.078095 . 1345329 8.01 0.000 .8127681 1.343421 
zsq 1.065932 .0819577 13.01 0.000 . 9042949 1.22757 
-cons .64508 . 1844171 3.50 0.001 .2813718 1.008788 


The semiparametric model coefficients and their standard errors are very 
close to those obtained by OLS regression on the DGP model. At the same 
time, the fit is much better for OLS, with R2 — 0.686 compared with 

R2 = 0.392, due to the noise in nonparametrically estimating g(z). 


The first panel of figure 27.2 presents a plot of g(z) on z that is 
automatically created by the semipar command. The relationship appears to 


be quadratic, as expected given that the DGP was linear in xı and £2 and g(z) 
, where g(z) was quadratic. 
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Figure 27.2. Semiparametric regression: Partial linear and single- 
index models 


The test (#) option of the semipar command computes the test due to 
Hardle and Mammen (1993) of whether g(z) is a polynomial of specified 
degree. This test was designed for comparing a kernel regression estimate 
with a polynomial function and may not be applicable here where g(z) is 
more complicated than simply the fitted values from kernel regression of y 
on z. From output not given, the test (1) option led to p = 0.009, the 
test (2) option led to p = 0.787, and the test (3) option led to p = 0.860. 
The null hypothesis is that g(z) is a polynomial of the specified degree, so at 
level 0.05, we reject a linear model and do not reject a quadratic or cubic 
model. We conclude that a quadratic model is appropriate. The test is 
implemented using a bootstrap. Before each test, we gave command set 
seed 10101 and used option nsims (1000) to increase the number of 
bootstraps beyond the default 100. 


Note that the results are very dependent on the bandwidths used. In 
particular, for the Robinson differencing method to yield root- N consistent 
estimates of B when z is scalar, the first-step kernel regressions should 
undersmooth, with bandwidths smaller than the usual optimal bandwidths. In 
end-of-chapter exercise 5, we reproduce the results of the semipar command 
from first principles using the 1poly command and find that in this example, 
the kernel bandwidths were reasonably narrow. 


The Robinson differencing method extends to the case where z is a 
vector rather than a scalar. The method focuses on estimation of 8B with g(z) 
viewed as a nuisance function. Other methods have been developed to fit 
partial linear models, including using a backfitting algorithm and using 
penalized splines. An example is given in section 27.8 on GAMS. 


A series estimator version of the partial linear model can be fit using the 
asis() option of the npregress series command; see section 27.3.2. 


More importantly, methods have been recently developed to use machine 
learning methods, rather than standard nonparametric methods, when z is 
high dimensional; see section 28.8. 


27.7 Single-index model 


We consider the single-index model 
y = m(x'B) +u 


where E(u|x) = 0, so the conditional mean of y is m(x’). For example, 
E(y|x) = exp (x’) is a single-index model, as are logit and probit models. 
A single-index model has the property that if 8; is a times 8%, then the 
marginal effect on E(y|x) of increasing Xj by one unit is a times the 
marginal effect of increasing £k by one unit; see section 13.7.3. 


If the function m(-) was known, we could estimate 8 by nonlinear least 
squares, which minimizes 


> ty — m(x;6)¥ 


Now consider the semiparametric case where m(-) is unknown. Then 
Ichimura (1993) proposed the semiparametric least-squares estimator, which 
obtains m and 3, which minimize 


N 2 
3 w (xi) x fy fi (xi) } 


where 77;(x//3) is a kernel regression estimate of E(y;|x/3) and w(x;) is a 
trimming parameter equal to one for most observations but equal to zero for 
those with the largest and smallest values of x;. 


Not all elements of 8 are identified. In general, a function m(a + bx) 
can always be simplified to a function h(x), so the intercept is not identified 
and ù is identified only up to a multiple. It follows that in the current setting, 
the intercept is not identified, and slope parameters are identified only up to 
scale. The standard procedure is to set the first slope parameter to one. 


The community-contributed sis command (Barker 2014) implements 
Ichimura’s estimator. We apply this command to the same generated data as 
used in the preceding partial linear model example, where we use knowledge 
that the regressors x are x1, x2, z, and zsq but do not use knowledge that the 
DGP specified m(x’3) = x’. We obtain 


. * Ichimura semiparametric least squares for single-index model 
. sls y x1 x2 z zsq, trim(1,99) 


initial: SSq(b) = 1632.3579 
alternative: SSq(b) = 1418.7811 
rescale: SSq(b) = 1113.168 
SLS 0: SSq(b) = 1113.168 
SLS 1: SSq(b) = 832.81923 
SLS 2: SSq(b) = 832.51979 
SLS 3: SSq(b) = 832.51701 
SLS 4: SSq(b) = 832.51701 
pilot bandwidth 
1.302350774 
SLS 0: SSq(b) = 852.36723 
SLS 1: SSq(b) = 836.63241 
SLS 2: SSq(b) = 813.31622 
SLS 3: SSq(b) = 810.40664 (not concave) 
SLS 4: SSq(b) = 807.00822 
SLS 5: SSq(b) = 802.75174 
SLS 6: SSq(b) = 801.87467 
SLS 7: SSq(b) = 800.86586 
SLS 8: SSq(b) = 800.728 
SLS 9: SSq(b) = 800.69764 
SLS 10: SSq(b) = 800.69069 
SLS 11: SSq(b) = 800.68898 
SLS 12: SSq(b) = 800.68864 
SLS 13: SSq(b) = 800.68856 
SLS 14: SSq(b) = 800.68855 
Number of obs = 200 
root MSE = 2.00086 
y | Coefficient Std. err. z P>|zl [95% conf. interval] 
Index 
x2 1.603177 . 3236122 4.95 0.000 . 9689089 2.237445 
Z 1.802992 .327212 5.51 0.000 1.161668 2.444316 
zsq 1.526091 . 2310128 6.61 0.000 1.073314 1.978867 
x1 1 (offset) 


The coefficient of variable x2, for example, implies that a l-unit change in 
x2 is associated with a change in E(y|x) that is 1.603 times the change 
associated with a l-unit change in x1. The need to estimate m/(-) leads to 
considerable loss in efficiency, with z statistics that are 15% to 30% lower 
than those obtained from OLS regression. For this DGP, with all 6; = 1, the 
ratios of all the slope coefficients should equal 1, aside from imprecision in 


estimation. 


The predictions M(x! B) can be obtained using the ey option of the 
predict postestimation command, and the xb option yields x’ 8. The 


R2 = 0.669 is close to R2 = 0.686 for OLS. The following code computes 
these quantities and plots m(x!3) and x! 3. 


. * Create plot of yhat against the index x’b 
. predict yhat, ey 


. predict Index, xb 
. qui correlate y yhat 


. display "R-squared = " r(rho)*2 
R-squared = .66931948 


. twoway (scatter y Index) (line yhat Index, sort lwidth(thick)), 
> title("Single-index: yhat against x“b") 
> xtitle("Index") ytitle("y and yhat") legend(off) 


The plot given in the second panel of figure 27.2 suggests that m/(-) is a 
linear function, as expected given that the DGP had m(x’3) = x’ 8. 


27.8 Generalized additive models 


The GAM specifies 


y=gn(t1)+-:-+q(tK)+4u 


where E(u|x) = 0. While the additivity may seem restrictive, the regressors 
may themselves be interactions. For example, we may have 
y = gı (x) + g2(2) + g3(@ x z) + u. 


We consider estimation when the functions g;(-),..., g«(-) are not 
specified. Then the GAM has reduced a k-dimensional nonparametric model 
to K one-dimensional nonparametric models. 


The model is fit using a backfitting algorithm. Let & = y, and initially 
set all Jy (xik) = 0. Define the partial residual for the jth regressor to be 


K 
rij = Yi - a X Gu rit) 
l=1;lA73j 
For each j = 1,..., K, perform univariate nonparametric regression of "ij 


on Tij and update Jj (x;;) to be the new fitted values from this regression. 
Repeat this cycle for as long as necessary until the estimates g; (x;;) 
stabilize. 


The most commonly used nonparametric method to estimate the 
components of the GAM are smoothing splines, rather than local constant or 
local linear kernel-weighted regression. Smoothing splines are presented in 
section 14.5.3. Briefly, a spline splits the data on the regressor x into 
segments with the segment boundaries called knots. A cubic spline fits a 
cubic polynomial between every pair of knots in such a way that the fitted 
value of ¥ is continuous at each knot and has continuous first and second 
derivatives at each knot. A cubic smoothing spline lets the knots be the 


distinct values of x but then includes a penalty function to avoid overfitting. 
The extent of the penalty varies with the so-called effective degrees of 
freedom; higher effective degrees of freedom correspond to a less smooth fit, 
while a value of one is equivalent to OLS linear regression; see section 14.5.3. 


The community-contributed gam command (Royston and Ambler 1998) 
implements estimation of the GAM using smoothing splines. It provides a 
front-end to Windows implementation of a slightly modified version of the 
Fortran program GAMFIT, written by Hastie and Tibshirani (1990). It will 
work only for Windows implementations of Stata. 


We apply the gam command to the same generated data as used in the 
preceding Robinson estimator example, where we do not use knowledge that 
z enters quadratically or that x1 and x2 enter linearly. For a reasonably 
flexible model, we set the effective degrees of freedom for each regressor to 
3. We obtain 


. * Generalized additive model - requires Windows version of Stata 
. gam y x1 x2 z, df(3) 


200 records merged. 


Generalized Additive Model with family gauss, link ident. 


Model df = 10.004 No. of obs = 200 
Deviance = 820.723 Dispersion = 4.31968 
y df Lin. Coef. Std. Err. Z Gain P>Gain 
x1 3.000 .8474175 . 1804832 4.695 1.998 0.3683 
x2 3.001 . 9830467 . 1459322 6.736 2.134 0.3444 
Zz 3.002 1.143934 . 1438065 7.955 131.768 0.0000 
_cons 1 2.1644 . 146964 14.727 
Total gain (nonlinearity chisquare) = 135.900 (6.003 df), P = 0.0000 


The reported effective degrees of freedom are close to, but not exactly 
identical to, those specified—this is typical and not a cause for concern. The 
output decomposes the effect of each variable into linear and nonlinear 
components. For the linear component, the reported coefficient of 0.847 for 
x1, for example, is the slope coefficient from linear OLS regression of the 
partial residual for x1 (r;,) on x1. All three linear components are 
statistically significant, indicating that there is a relationship. For the 
nonlinear component, the Gain and P>Gain columns provide a test of 


whether each g,(-) is nonlinear. At level 0.05, we reject nonlinearity in x1 
and x2, while there is clearly nonlinearity in z. These results are all expected 
given the DGP. 


The model fit is good. The statistic dispersion 1s 
SA (yi — 9:)2/(N — df), where here df = 10.004, so 
So (yi — Gi)? = 4.31968 x (200 — 10.004) = 820.72 . And 
DAN? — y)? = 2584.86. So R? = 0.682, very close to 0.686 from OLs 
regression of y on x1, x2, z, and zsq. 


The gam command stores the fitted values y; in variable GAM mu. Let 
variable x1 be the first regressor, denoted 71. Then the variable s_x1 stores 
gi(x;1), the variable e x1 stores the standard error of gı (x;1), and the 
variable r_x1 stores the partial residual r;1. So a plot of s_x1 on x1 is a plot 
of Jı (aj) ON La. 


The gamplot postestimation command plots for a specified regressor, 
say, Tj, and the fitted smooth 9} (x;;) against Tij, along with a 
95% confidence interval and a scatterplot of the partial residual "ij against 
Tij. Creating these plots separately for each of the three regressors, we have 


. * Plot fitted smooth (and partial residual) against x for each regressor x 
. qui gam y x1 x2 z, df(3) 


. gamplot x1, saving(graphi.gph, replace) 
(file graphi.gph saved) 


. gamplot x2, saving(graph2.gph, replace) 
(file graph2.gph saved) 


. gamplot z, saving(graph3.gph, replace) 
(file graph3.gph saved) 


. graph combine graphi.gph graph2.gph graph3.gph, 
> iscale(1.2) rows(1) ysize(2) xsize(6) 


For the first panel in figure 27.3, the solid line is a plot of the fitted smooth 
s_ x1 on x1; the upper confidence band is a plot of s_ x1 + 1.96x e x1 on 
x1; and the scatterplot is of the partial residual r_x1 on x1. The panels 
clearly show a linear relationship for x1 and x2 and a quadratic relationship 
for z. 
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Figure 27.3. GAM: Partial residuals and fitted g(-) against 
regressors x1, x2, and z 


The gam command provides an alternative method for fitting the partial 
linear model by setting the effective degrees of freedom to one for the 


variables that enter linearly and using a cubic smoothing spline with 


effective degrees of freedom three, for example, for the variables that enter 


nonlinearly. For the current example, we have 


* Partial linear model y = bO+b1*x1+b2*x2+g(z) estimated using GAM 
. gam y x1 x2 z, df(x1:1, x2:1, 2:3) 


200 records merged. 


Generalized Additive Model with family gauss, link ident. 


Model df 
Deviance 


Total gain (nonlinearity chisquare) = 


. display "R2 = " 1 - e(disp)*e(tdf) / 2584.6 
-67645209 


R2 = 


The R2 is 0.676 compared with 0.392 using Robinson’s differencing 


estimator and 0.686 using OLS. 


= 6.002 
= 836.24 
df Lin. Coef. Std. Err. 
1 .8482625 . 1802929 
1 .9720385 . 1457786 
3.002 1.137985 . 1436547 
1 2.1644 . 146809 


No. of obs = 
Dispersion = 
Zz Gain 
4.705 
6.668 ; 
7.922 137.572 
14.743 


200 
4.31057 


P>Gain 


0.0000 


137.572 (2.002 df), P = 0.0000 


27.9 Additional resources 


The standard econometrics references for nonparametric and 
semiparametric regression are Pagan and Ullah (1999) and Li and 
Racine (2007). Cameron and Trivedi (2005, chap. 9) and Hansen (2022, 
chaps. 19 and 20) provide a briefer treatment. 


The npregress kernel command computes kernel-weighted local 
constant and local linear regression with single or multiple regressors and is 
a much richer command than lpoly. The npregress series command 
implements nonparametric polynomial and spline regression. The most 
commonly used semiparametric models can be implemented using the 
following community-contributed commands detailed in the main text: 
semipar for the partial linear model, sis for the single-index model, and 
gam for the GAM. The last command works only on Windows computers. 
See also chapter 28, which presents machine learning methods for 
prediction such as lasso, neural networks, regression trees, and random 
forests, and for estimation of the partial linear model. 


27.10 Exercises 


1. Consider data on (y, x) with sample values (3, 2), (2,4), (6,6), (10,8) 
, and (8, 10). Using the npregress command, obtain the local constant 
estimator of y given x using the epan2 kernel and option 
meanbwidth (3, copy), which sets the bandwidth A = 3. Use the 
predict command to obtain the fitted values. Now, manually compute 
the local constant estimate of y when x = 6, and verify that you obtain 
the same estimate as that obtained from the npregress command. 

2. Use the generated dataset of section 27.6, and perform nonparametric 
regression of y on z. First, perform local linear regression using the 
npregress command with epan2 kernel. Repeat using the lpoly 
command with the at (z) and degree (1) options. Compare the fitted 
values and explain any difference. Now, again use the npregress 
command, but set the bandwidth to be the same as that used by the 
lpoly command. Compare the fitted values and explain any difference. 
Now, repeat the entire exercise but using local constant regression, 
rather than local linear regression. You will need to use the 
noderivative option of the npregress command. 

3. Use mus204mabeldecomp.dta, which has Australian data on log of 
doctor annual earnings (logyearn) and on annual hours of work 
(yhrs). Form variable Logyhrs (= In(yhrs)), and to speed 
computation, obtain a 2% subsample using commands set seed 
10101 and keep if runiform()<0.02. For visual analysis, provide a 
graph with a scatterplot of logyearn on logyhrs, a fitted linear 
regression, and a fitted local linear regression using the 1poly 
command. Does the relationship appear to be linear? Next, obtain local 
linear estimates using command npregress kernel logyearn 
logyhrs kernel (epan2) vce(bootstrap, seed(10101) reps(100). 
Is the estimated average effect consistent with an hours elasticity of 
earnings equal to one? Compare the estimated average effect to that 
obtained from OLS regression. Finally, following the local linear 
regression, obtain predicted means of logyearn when variable 
logyhrs 1s evaluated at values corresponding to annual hours of 1,000, 
2,000, and 3,000. 


4. Continue with the data of the previous question. We will add as 
controls experience (expr), experience-squared (exprsq), gender 
(female), and presence in household of child of age under five 
(childu5). Perform local linear regression of logyearn on logyhrs 
and these additional controls—note that the discrete regressor needs to 
be included as i. female. Is the estimated average effect consistent 
with an hours elasticity of earnings equal to one? Does gender make a 
big difference? Compare the estimated average effects of regressors 
with those obtained from OLS regression. Finally, following the local 
linear regression obtain predicted means of logyearn when variable 
logyhrs 1s evaluated at values corresponding to annual hours of 1,000, 
2,000, and 3,000. 

5. Implement the semipar example in section 27.6 manually as follows. 
Use command 1poly with options at (z), degree (1), and 
kernel (gaussian) to perform local linear regression of x1 on z. 
Compute a variable u_x1 that is the residual from this regression. 
Repeat for regression of x2 on z and y on z to yield residuals u_x2 and 
u_y. OLS regress u_y On u x1 andu x2, without an intercept, and 
obtain heteroskedastic—robust standard errors. Compare your answer 
with that from command semipar y x1 x2, nonpar(z) nograph 
robust. Next, we want to reproduce the graph produced by the 
semipar command. Following the preceding OLS regression, generate 
the variable ylessxb = y-_b[u_x1]*x1-_b[u_x2]*x2. Produce a 
scatterplot of ylessxb on z, along with the fitted values from local 
linear regression of ylessxb on z using command 1poly with options 
at(z), degree (1), and kernel (gaussian). 


Chapter 28 
Machine learning for prediction and inference 


28.1 Introduction 


Microeconometrics studies tend to focus on estimation of one or more 
regression model parameters 8 and subsequent statistical inference on 8 or 
relevant marginal effects that are a function of £. 


A quite different purpose of statistical analysis is prediction. For 
example, we may wish to predict the probability of 12-month survival 
following hip or knee replacement surgery. In that case, we are interested in 
obtaining a good prediction y of the dependent variable y. 


In principle, nonparametric methods such as kernel regression can 
provide a flexible way to obtain predictions. But these methods suffer from 
the curse of dimensionality if there are many potential regressors. The 
machine learning literature has proposed a wide range of alternative 
techniques for prediction, where the term “machine learning” is used 
because the machine, here the computer, selects the best predictor using 
only the data at hand, rather than via a model specified by the researcher, 
who has detailed knowledge of the specific application. 


Machine learning entails data mining that can lead to overfitting the 
sample at hand. To guard against overfitting, one assesses models on the 
basis of out-of-sample prediction using cross-validation (Cv) or by 
penalizing model complexity using information criteria or other penalty 
measures. We begin by presenting Cv and penalty measures. 


We then present various techniques for prediction that are used in the 
machine learning literature. We focus on shrinkage estimators, notably the 
lasso. Lasso, originally an acronym for “least absolute shrinkage and 
selection operator”, is now considered a word. Additionally, a brief review 
is provided of principal components for dimension reduction, as well as 
neural networks, regression trees, and random forests for flexible nonlinear 
models. There is no universal best method for prediction, though for some 
specific data types, such as images or text, one method may work 
particularly well. Often, an ensemble prediction that is a weighted average 
of predictions from different methods performs best. 


The machine learning literature has focused on prediction of y. The 
recent econometrics literature has developed methods that use machine 
learning methods as an intermediate input to the ultimate goal of estimating 
and performing inference on model parameters of interest. This is a very 
active area of research. 


A leading inference example of machine learning methods is inference 
on q@ in the partial linear model y = d’a@ + g(x) + u, where machine 
learning methods are used to determine the best function of controls x. A 
second leading example is instrumental-variables (Iv) estimation with many 
potential instruments or controls, or both. Then we wish to select only a few 
variables to avoid the weak instrument problems that can arise with many 
instruments. These and related examples cover many applied 
microeconometrics applications, and the development of methods for valid 
inference with machine learning as an input promises to be revolutionary. 


The econometrics inference literature to date has emphasized use of the 
lasso, and Stata 16 has introduced a suite of commands for prediction and 
inference using the lasso. For these reasons, we emphasize the lasso. We 
expect that additional methods, notably, random forests and neural nets, will 
be increasingly used in microeconometric studies. 


28.2 Measuring the predictive ability of a model 


There are several ways to measure the predictive ability of a model. Ideally, 
such measures penalize model complexity and control for in-sample 
overfitting. 


Traditionally, econometricians have used penalty measures such as in- 
sample adjusted R2 and in-sample information criteria. The machine 
learning literature instead emphasizes out-of-sample predictive ability using 
cv. An introductory treatment is given in James et al. (2021, chaps. 5, 6.1). 


28.2.1 Generated data example 


The example used in much of this chapter is one where the continuous 
dependent variable y is regressed on three correlated normally distributed 
regressors, denoted x1, x2, and x3. The actual data-generating process (DGP) 
for y is a linear model with an intercept and x1 alone. Many of the methods 
can be adapted for other types of data such as binary outcomes and counts. 


The data are generated using commands presented in chapter 5, notably, 
the command drawnorm to generate correlated normally distributed 
regressors with correlation 0.5 and the rnormal () function to obtain 
normally distributed errors. The DGP for y is a linear model with intercept 2 
and slope coefficient 1 for variable x1. We have 


. * Generate three correlated variables (rho = 0.5) and y linear only in x1 


. qui set obs 40 

. set seed 12345 

. matrix MU = (0,0,0) 

. scalar rho = 0.5 

. matrix SIGMA = (1,rho,rho \ rho,i,rho \ rho,rho,1) 
. drawnorm x1 x2 x3, means(MU) cov(SIGMA) 


. generate y = 2 + 1*x1 + rnormal(0,3) 


Summary statistics and correlations for the variables are 


. * Summarize data 


summarize 
Variable Obs Mean Std. dev. Min Max 
x1 40 .3337951 .8986718 -1.099225 2.754746 
x2 40 .1257017 .9422221 -2.081086 2.770161 
x3 40 .0712341 1.034616 -1.676141 2.931045 
y 40 3.107987 3.400129 -3.542646 10.60979 
correlate 
(obs=40) 
x1 x2 x3 y 
xi 1.0000 
x2 0.5077 1.0000 
x3 0.4281 0.2786 1.0000 
y 0.4740 0.3370 0.2046 1.0000 


Ordinary least-squares (OLS) estimation of y on x1, x2, and x3 yields the 


following: 


. * OLS regression of y on x1-x3 
. regress y x1 x2 x3, vce(robust) 


Linear regression Number of obs = 40 
F(3, 36) 3 4.91 
Prob > F = 0.0058 
R-squared = 0.2373 
Root MSE = 3.0907 
Robust 
y | Coefficient std. err. t P>|tl [95% conf. interval] 
x1 1.555582 5006152 3.11 0.004 . 5402873 2.570877 
x2 4707111 .5251826 0.90 0.376 -.5944086 1.535831 
x3 - .0256025 . 6009393 -0.04 0.966 -1.244364 1.193159 
_cons 2.531396 .5377607 4.71 0.000 1.440766 3.622025 


Only variable x1 is statistically significant at level 0.05. This is very likely 
given the DGP depends only on x1, but it is by no means certain. Because of 
randomness, we expect that variable x2 or variable x3 will be statistically 
significant at level 0.05 in 5% of similarly generated datasets. 


28.2.2 Mean squared error 


In general, we predict at point Xo using Yo = g(x), where for OLS Gy = x) 3 
. We wish to estimate the expected prediction error E{ (yo — Yo)?}- 


The standard criterion used for a continuous dependent variable is 
minimization of the mean squared error (MSE) of the predictor 


If the MSE is computed in sample, it underestimates the true prediction 
error. One way to see this underestimation is that with NV independent 
regressors, including the intercept, OLS necessarily produces a perfect fit 
with R2 — 1 and MSE = 0. By contrast, the true prediction error will 
generally be greater than 0. 


A second way to see this underestimation is to note that if y = XG+u, 
then the OLS residual vector G@ = y — XB = (I — M)u, where 
M = X(X’X)~1!X. Because (I — M) < I in the matrix sense, it follows 
that on average |&;| < |u;|, so the OLS residual on average is smaller than the 
true unknown error. For similar reasons, with independent homoskedastic 
errors, the unbiased estimator of g2 is 52 = {1/(N — K)} De (Yi — MA 
and not the smaller (1/N) Se (yi — P) 

Several methods seek to adjust MSE for model size. These include 
information criteria that penalize MSE for model size and Cv measures that fit 
the model on a subsample and compute MSE for the remainder of the sample. 


We focus on using MSE as the measure of predictive ability. For 
continuous data, other measures can be used such as the mean absolute error 
(1/N ee ily: — Yi|- For likelihood-based models, the log likelihood may 
be used. For generalized linear models, the deviance may be used. For 
binary outcomes, the number of incorrect classifications is commonly used if 
interest lies in predicting the actual outcome y, rather than Pr(y = 1). 


28.2.3 Information criteria and related penalty measures 


Two standard measures that penalize model fit for model size are Akaike’s 
information criterion (AIC) and the Bayesian information criterion (BIC). The 
general formulas for these measures were presented in section 13.8.2. 


Specializing to the classical linear regression model under independent 
and identically distributed normal errors, the fitted log likelihood equals 
Nln2r + N + ln MSE, leading to 


AIC = N ln2r + N + InMSE + 2k 
BIC = N ln2r + N +ln MSE + (InN) x K 


where K is the number of regressors, including the intercept. Models 
with smaller AIC and BIC are preferred, so AIC and BIC are penalized measures 
of MSE. BIC has a larger penalty for model size than AIC, so BIC leads to 
smaller models and might be preferred when more parsimonious models are 
desired. 


A related information measure is Mallow’s Cp measure, 
Cp = (N x MSE/G?) — N +2K 


where G? = eae (yi — yi)? /(N — p) and y; is the OLs prediction from the 
largest model under consideration that has p regressors, including the 
intercept. Models with smaller Cp are preferred. Selecting a model on the 
basis of minimum C% is asymptotically equivalent to minimizing AIC. 


Another penalty measure that is often used is adjusted R2, denoted R 
which can be expressed as 


R MSE x N/(N — K) 
© TSS/(N=1) 


where Tss — yee (yi — 9;)?- Again, MSE is penalized for model size 
because p? is a decreasing function of K. However, this penalty is relatively 


small. In the classical linear regression model with homoskedastic normally 
distributed errors, it can be shown that, for nested models, choosing the 
larger model if it has higher R? is equivalent to choosing the larger model if 
the F statistic for testing joint significance of the additional regressors 
exceeds one. The usual critical values used in such a test are substantially 


greater than one. 


To enable comparison of all eight possible models that are linear in 
parameters and regressors, we first define the regressor lists for each model. 


. * Regressor lists for all possible models 


. global 
. global 
. global 
. global 
. global 
. global 
. global 
. global 


xlistl 
xlist2 
xlist3 
xlist4 
xlist5 
xlist6 
xlist7 
xlist8 


x1 
x2 
x3 
x1 
x2 
x1 


x1 


x2 x3 


The following code provides a loop that estimates each model and 
computes the various penalized fit measures. Note that the global macro 
defining the kth regressor list needs to be referenced as ${xlist‘k’} rather 
than simply $xlist‘k’. We have 


* Full-sample estimates with AIC, BIC, Cp, R2adj penalties 
. gui regress y $xlist8 


s2full = e(rmse)“2 // Needed for Mallows Cp 


. forvalues k = 1/8 { 
qui regress y ${xlist°k*} 
scalar mse`k’ = 
scalar r2adj`k/ 
scalar aic’k™ = 
scalar bic’k” = 
scalar cp k”™ 
display "Model " "${xlist~k°}" _col(15) " MSE=" %6.3f mse`k” 

" R2adj=" %6.3f r2adj°k° " 


scalar 


Model x1 
Model x2 
Model x3 
Model xi 
Model x2 
Model x1 
Model xi 


The MSE, 


28.2.4 The splitsample command 


MSE= 
MSE= 
MSE= 
MSE= 
x2 MSE= 
x3 MSE= 
x3 MSE= 
x2 x3 MSE= 


272 
.739 
.992 
. 800 
.598 
.842 
. 739 
.597 


e(rss)/e(N) 
= e(r2_a) 


-2*e(11) + 2*xe(rank) 
-2*e(11) + e(rank)*ln(e(N)) 
e(rss)/s2full - e(N) + 2*e(rank) 


R2adj= 
R2adj= 
R2adj= 
R2adj= 
R2adj= 
R2adj= 
R2adj= 
R2adj= 


o0oo0oo0oo0o0o0o0o0O 


. 000 
. 204 
.090 
.017 
.196 
.080 
.183 
.174 


AIC= 
AIC= 
AIC= 
AIC= 
AIC= 
AIC= 
AIC= 
AIC= 


212. 
204. 
209. 
212. 
205. 
210. 
206. 
207. 


41 
23 
58 
70 
58 
98 
23 
57 


AIC=" 7%7.2f aic`k’ 
" BIC=" %7.2f bic*k” " Cp=" 46.3f cp°k’ 


BIC= 
BIC= 
BIC= 
BIC= 
BIC= 
BIC= 
BIC= 
BIC= 


214. 
207. 
212. 
216. 
210. 
216. 
.29 
214. 


211 


10 
60 
96 
08 
64 
05 


33 


Cp= 
Cp= 
Cp= 
Cp= 
Cp= 
Cp= 
Cp= 
Cp= 


9.199 
0.593 
5.838 
9.224 
2. 
7 
2 
4 


002 


.211 
.592 
. 000 


which does not penalize for model size, is smallest for the largest 

model with all three regressors. The penalized measures all favor the model 
with an intercept and x1 as the only regressors. More generally, different 
penalties may favor different models. 


The preceding example used the same sample for both estimation and 
measurement of model fit. cv instead fits the model on one sample, called a 
training sample, and measures predictive ability based on a different sample, 
called a test sample or holdout sample or validation sample. This approach 
can be applied to a range of models and to loss functions other than MSE. 


Mutually exclusive samples of prespecified size can be generated using 
the command splitsample, which creates a variable identifying the different 
samples. For example, the sample can be split into five equally sized 
mutually exclusive samples as follows: 


. * Split sample into five equal-size parts using splitsample command 
. splitsample, nsplit(5) generate(snum) rseed(10101) 


. tabulate snum 


snum Freq. Percent Cum. 

1 8 20.00 20.00 

2 8 20.00 40.00 

3 8 20.00 60.00 

4 8 20.00 80.00 

5 8 20.00 100.00 
Total 40 100.00 


For replicability, the rseed() option is used. The variable snum identifies the 
five samples. The alternative split () option allows splitting in specified 
ratios. For example, split(1 1 2) will split the sample into subsamples of, 
respectively, 25%, 25%, and 50% of the sample. The cluster () option 
enables sample splitting by cluster. If data are missing, then it is best to 
include as an argument a list of relevant variables to ensure that the splits are 
on the sample with nonmissing observations. This is especially important if, 
for example, observations with missing values appeared at the end of the 
dataset. 


28.2.5 Single-split cross-validation 


We begin with the simplest approach of single-split validation. We randomly 
divide the original sample into two parts: a training sample on which the 
model will be fit and a test sample that will be used to assess the fit of the 
model. 


It is common to use a larger part of the original sample for estimation 
and a smaller part for assessing predictive ability. The following code creates 
an indicator variable for a training sample of 80% of the data (dt rain==1) 
and a test sample of 20% of the data (dt rain==0). 


* Form indicator for training data (80% of sample) and test data (20%) 
splitsample, split(1 4) values(0 1) generate(dtrain) rseed(10101) 


. tabulate dtrain 


dtrain 


0 
1 


Total 


Freq. 


32 


40 


Percent 


20.00 
80.00 


100.00 


Cum. 


20.00 
100.00 


We fit each of the 8 potential regression models on the 32 observations in 
the training sample and compute the MSE separately for the 32 observations 
in the training sample (an in-sample MSE) and for the 8 observations in the 
test sample (an out-of-sample MSE). 


. * Single-split validation - training and test MSE for the 8 possible models 
. forvalues k = 1/8 { 


2. qui reg y ${xlist°k°} if dtrain== 

3. qui predict y`k’hat 

4. qui gen y k’errorsq = (y`k’hat - y)^2 

5. qui sum y k’errorsq if dtrain == 1 

6. scalar mse’ k“train = r(mean) 

7. qui sum y k’errorsq if dtrain == 0 

8. qui scalar mse°k°test = r(mean) 

9. display "Model " "${xlist~k°}" _col(16) 
> " Training MSE = " %47.3f mse`k’train " Test 

10. } 
Model Training MSE = 10.124 Test MSE = 16 
Model x1 Training MSE 7.478 Test MSE = 13 
Model x2 Training MSE 8.840 Test MSE = 14 
Model x3 Training MSE = 9.658 Test MSE = 15 
Model x1 x2 Training MSE = 7.288 Test MSE = 13 
Model x2 x3 Training MSE = 8.668 Test MSE = 14 
Model x1 x3 Training MSE = 7.474 Test MSE = 13 
Model x1 x2 x3 Training MSE = 7.288 Test MSE = 13 


. drop y*hat y*errorsq 


MSE = " %7.3f mse >k’test 


. 280 
.871 
.803 
.565 
973 
674 
. 892 
.980 


As expected, the in-sample MSE (where we normalize by N) decreases as 
regressors are added and is minimized at 7.288 when all 3 regressors are 


included. 


But when we instead consider the out-of-sample MSE, we find that this is 
minimized at 13.871 when only x1 is a regressor. Indeed, the model with all 
three regressors has the fifth-highest out-of-sample MSE, due to in-sample 


overfitting. 


28.2.6 K-fold cross-validation 


The results from single-split validation depend on how the sample is split. 
For example, in the current example, different sample splits due to different 
seeds can lead to out-of-sample MSE being minimized by models other than 
that with x1 alone as regressor. K-fold cv reduces this limitation by forming 
more than one split of the full sample. 


Specifically, the sample is randomly divided into K groups or folds of 
approximately equal size. In turn, one of the K folds is used as the test 
dataset, while the remaining K — 1 folds are used as the training set. Thus, 
when fold 1 is the test dataset, the model is fit on folds 2 to K; when fold 2 
is the test dataset, the model is fit on fold 1 and folds 3 to kK; and so on. The 
following shows the case kK = 5. 


Fit on folds Test on fold 
j=1 2,3,4,5 1 
j=2 1,3,4,5 
j=3 1,2,4,5 


2 
3 
j=4 1,235 4 
j=5 1,2,3,4 5 


Then the cv measure is the average of the K MSEs, 


1 K 
CVK = K 2 


where MSE(;) is the MSE for fold 7 based on OLS estimates obtained by 
regression using all data except fold 7. 


As the number of folds increases, the training set size increases, so bias 
decreases. At the same time, the fitted models overlap, so test set predictions 


are more highly correlated, leading to greater variance in the estimate of the 
expected prediction error E{ (yo — Jo)? }. The consensus is that K = 5 or 
K = 10 provides a good balance between bias and variance; most common 
is to set K = 10. 


With so few observations, we use kK = 5 and obtain the MSE for each 
fold. 


* Five-fold CV example for model with all regressors 
. splitsample, nsplit(5) generate(foldnum) rseed(10101) 


. matrix allmses = J(5,1,.) 


. forvalues i = 1/5 { 
2. qui reg y x1 x2 x3 if foldnum != `i’ 


3. qui predict y i‘hat 

4. qui gen y i’errorsq = (y`iʻhat - y)^2 
5. qui sum y i’errorsq if foldnum =="i~ 
6. matrix allmses[~i~,1] = r(mean) 

7. } 


. matrix list allmses 


allmses[5,1] 
cl 
ri 13.980321 
r2 6.4997357 
r3 9.3623792 
r4 6.413401 
r5 12.23958 


To obtain the CV5 measure, we convert the matrix allmses to a variable 
and obtain its mean. 


. * Compute the average MSE over the five folds and standard deviation 
. svmat allmses, names(vallmses) 


. qui sum vallmses1 


. display "CV5 = " %5.3f r(mean) " with st. dev. = " %5.3f r(sd) 
CV5 = 9.699 with st. dev. = 3.389 


The resulting CVs measure is 9.699. The MSE’s do vary considerably over the 
five folds with standard deviation 3.389, which is large relative to the 
average. 


The community-contributed crossfold command (Daniels 2012) 
performs K-fold cv. Applying this to all 8 potential models, using K = 5 


and the same split for each model, we obtain 


. * Five-fold CV measure for all possible models 
. forvalues k = 1/8 { 
2. set seed 10101 


3. qui crossfold regress y ${xlist°k°}, k(5) 

4. matrix RMSEs`k”^ = r(est) 

5. svmat RMSEs`k’, names (rmse`k’) 

6. qui generate mse’ k” = rmse’k°~2 

7. qui sum mse k~™ 

8. scalar cvik’ = r(mean) 

9. scalar sdcv^k’ = r(sd) 

10. display "Model " "${xlist~k°}" _col(16) " CV5 = " %7.3f cv`k’ 
> " with st. dev. = " %7.3f sdcv`k’ 

11. } 
Model CV5 = 11.960 with st. dev. = 3.561 
Model x1 CV5 = 9.138 with st. dev. = 3.069 
Model x2 CV5 = 10.407 with st. dev. = 4.139 
Model x3 CV5 = 11.776 with st. dev. = 3.272 
Model x1 x2 CV5 = 9.173 with st. dev. = 3.367 
Model x2 x3 CV5 = 10.872 with st. dev. = 4.221 
Model x1 x3 CV5 = 9.639 with st. dev. = 2.985 
Model xi x2 x3 CV5 = 9.699 with st. dev. = 3.389 


The crossfold command reports for each fold root mean squared error 
(RMSE) the square root of the MSE, rather than the MSE. To compute the CV; 
measure, we retrieve the RMSEs stored in matrix r (est) and calculate the 
average of the squares of the RMSEs. 


The cv measure is lowest for the model with x1 the only regressor. 
However, it is only slightly higher in the model with both x1 and x2 
included. Recall that the folds are randomly chosen, so that with different 
seed, we would obtain different folds, different cv values, and hence might 
find that, for example, the model with both x1 and x2 included has the 
minimum Cv. 


Because of the randomness in K-fold Cv, some studies use a one 
standard-error rule that chooses the smallest model with cv within one 
standard deviation of the model with minimum cv. Applying that rule in this 
particular example favors (marginally) an intercept-only model because 

9.138 + 3.069 = 12.207 > 11.960. 


Once the preferred model is obtained by cv, it is fit using the entire 
sample or, potentially, a new sample. Note that the data mining to obtain a 
preferred model introduces issues similar to those raised by pretest bias; see 
section 11.3.8. Section 28.8 provides examples of special settings, methods, 
and assumptions for which it is possible to ignore the data mining. 


CV is easily adapted to estimators other than OLS and to other loss 
functions such as mean absolute error (1/N ead i lyi — Pl- Information 
criteria have the advantage of being less computationally demanding and can 
yield results not too dissimilar from those obtained using cv. 


28.2.7 Leave-one-out cross-validation 


Leave-one-out cross-validation (LOOCV) is the special case of -fold cv with 
K = N. Then N models are fit; in each model (N — 1) observations are 
used in training, and the remaining observation is used for validation. So we 
drop each observation in turn, fitting a model without that observation and 
then using the fitted model to predict the dropped observation. 


The community-contributed 1oocv command (Barron 2014) implements 
Loocv. Note, however, that it is quite slow because it is written to apply to 
any Stata estimation command and does not take advantage of the great 
computational savings that are possible in the special case of OLS. 


For the model with x1 the only regressor, we obtain 


. * LOOCV 
. loocv regress y x1 


Leave-One-Out Cross-Validation Results 


Method Value 
Root Mean Squared Errors 3.0989007 
Mean Absolute Errors 2.5242994 
Pseudo-R2 . 15585569 
. display "LOOCV MSE = " r(rmse)~2 


LOOCV MSE = 9.6031853 


The MSE from LOOCV is 3.99892 = 9.603, compared with the preceding Cvs 
measure of 9.138 in the model with just x1 as regressor. 


LOOCV is not as good for measuring global fit because the N folds are 
highly correlated with each other, leading to higher variance than if K = 5 
or K = 10. LOOCVis used especially for nonparametric regression where 
concern is with local fit; see section 27.2.4. Assuming a correctly specified 
likelihood model, model selection on the basis of LOOCV is asymptotically 
equivalent to using AIC. 


28.2.8 Best subsets selection and stepwise selection 


The best subsets method sequentially determines the best-fitting model for 
models with one regressor, with two regressors, and so on up to all p 
potential regressors. Then Ķg-fold cross-validated MSE or a penalty measure 
such as AIC or BIC is computed for these p best-fitting models of different 
sizes. In theory, there are 2? models to fit, but this is greatly reduced by 
using a method called the leaps-and-bounds algorithm. 


Stepwise selection methods, introduced in section 11.3.7, entail less 
computation than the best subsets method. For example, with p potential 
regressors, the stepwise forward procedure requires 
p+(p—1)+---+1=p(p+ 1)/2 regressions. 


The community-contributed vselect command (Lindsey_and 
Sheather 2010) implements best subsets and stepwise selection methods for 
OLS regression with predictive ability measured using any of adjusted R2, 
AIC, BIC, or AICC, where AICC is a bias-corrected version of AIc that equals 
AIC + 2(K +1)(K + 2)/(N — K — 2). The vselect command, however, 
does not cover K-fold cross-validated MSE. 


The default for the vselect command is to use the best subsets method. 
We obtain 


. * Best subset selection with community-contributed command vselect 
. vselect y x1 x2 x3, best 


Response : y 
Selected predictors: xi x2 x3 


Optimal models: 


# Preds R2ADJ C AIC AICC BIC 
1 .2043123 .5925225 204.2265 204.8932 207.6042 
2 .1959877 2.002325 205.5761 206.7189 210.6427 


3 .1737073 4 207.5735 209.3382 214.329 
predictors for each model: 
1 : xl 
2 : xi x2 
3 : x1 x2 x3 


For models of a given size, all measures reduce to minimizing MSE, while the 
various models give different penalties for increased model size. The best- 
fitting models with one, two, and three regressors are those with, 
respectively, regressors x1, (x1,x2), and (x1,x2,x3). All the penalized 
measures favor the model with just x1 and an intercept as regressor. 


The forward (or backward) option of the vselect command implements 
forward (or backward) selection. Then one additionally needs to specify 
which of the various penalty measures is used as model-selection criterion. 
The fix () option of the vselect command enables specifying regressors 
that should be included in all models, and the command permits weighted 
regression. 


The community-contributed gvselect command (Lindsey and 
Sheather 2015) implements best subsets selection for any Stata command 
that reports a fitted log likelihood. Then the best-fitting model of a given size 
is that with the highest-fitted log likelihood, and the best model overall is 
that with the smallest AIc or BIC. 


28.3 Shrinkage estimators 


The linear model can be made quite flexible by including as regressors 
transformations of underlying variables such as polynomials and 
interactions. Nonetheless, the machine learning literature has introduced 
other models and other estimation methods that can predict better than OLS. 


In this section, we present shrinkage estimators, most notably, the lasso, 
that shrink parameter estimates toward zero. The resultant reduction in 
variability may be sufficiently large enough to offset the induced bias 
leading to lower MSE. 


To see this potential gain, consider a scalar unbiased estimator g with 
E(0) = 6 and Var(6) = v. Then MSE (f) = y because the MsE equals 
variance plus squared bias 


Now, define the shrinkage estimator 9 — 4g, where 0 < a < 1. Then 
Var(0) = Var(aé) = a?v and Bias(@) = E(0) — 0 = (a — 1)0, SO MSE 

(0) = a?v + (a — 1)62. In the case that g shrinks all the way to 0 (a = 0), 
g has lower MsE than @ if 62 < y. And if 9 — 0.99, then g has lower MSE 
than ĝ for 92 < 19v. The potential reduction in MSE of @ carries over 


directly to the predictor 7 = x99. 


Shrinkage methods are also called penalized or regularized methods. 
Many shrinkage estimators can also be interpreted in a Bayesian framework 
as weighted sums of a specified prior and sample maximum-likelihood 
estimator. In some other cases, the shrinkage estimator may be a limiting 
form of such an estimator. 


We focus on shrinkage for linear regression with MSE loss. But 
shrinkage estimators can be applied to other settings with different loss 


functions, such as maximum likelihood estimation for the logit model. 


We present the leading shrinkage estimators: ridge regression shrinks all 
parameters toward zero, the lasso sets some parameters to zero, while other 
parameters are shrunk toward zero, and the elastic net combines ridge 
regression and lasso. An introductory treatment is given in James 
et al. (2021, chap. 6.2). 


28.3.1 Ridge regression 


The ridge estimator of Hoerl and Kennard (1970), also known as Tikhonov 
regularization, is a biased estimator that reduces MSE by retaining all 
regressors but shrinking parameter estimates toward zero. 


The ridge estimator B , of 8 minimizes 


N Pp 
Q(B) =< DOTE x18)? +47 583 (28.1) 
i=1 = 


where \ > 0 is a tuning parameter that needs to be provided and P is the 
number of regressors. Different values of the penalty parameter A lead to 
different ridge estimators. The regressors x are standardized, and some 
simpler methods set x; = 1 for all 7. 


The first term in the objective function is the sum of squared residuals 
minimized by the OLS estimator. The second term is a penalty that for given 


A is likely to increase with the number of regressors P. 


The resulting ridge estimator when all x; = 1 can be expressed as 


B, = (X'X +ANI,) * X'y 


where p is the number of regressors. This estimator is a shrinkage estimator 
because it shrinks the OLS estimator B = X X)~1X’y, the special case 

= 0, toward 0. In the simplest case that the regressor matrix X is 
orthonormalized so that X’X = I,, the OLS estimator is X’y, and the ridge 
estimator is X/y/(1 + A), which shrinks all coefficients toward 0 by the 
same multiplicative factor. For a given specification, a shrinkage factor that 
would lower the MSE of the prediction can be shown to exist; the practical 
task is to estimate it. 


When the basic ridge regression is used, it is customary to standardize 
the regressors to have zero mean and unit variance. Some references and 
ridge regression programs assume that the dependent variable has been 
demeaned to have zero mean. In that case, y; is replaced with y; — y, and, 
without loss of generality, the intercept can be dropped. The following code 
standardizes the regressors and demeans the dependent variable. 


. * Standardize regressors and demean y 

. foreach var of varlist x1 x2 x3 { 
2. qui egen double z`var^ = std(`var’) 
3. } 

. qui summarize y 

. qui generate double ydemeaned = y - r(mean) 


. summarize ydemeaned z* 


Variable Obs Mean Std. dev. Min Max 
ydemeaned 40 -3.33e-17 3.400129 -6.650633 7.501798 
zZx1 40 2.63e-17 1 -1.594598 2.693921 
ZX2 40 2.62e-17 1 -2.34211 2.80662 
Zzx3 40 -2.98e-17 1 -1.688912 2.764129 


The Stata commands presented below for shrinkage estimation 
automatically standardize regressors. Then the preceding code is 
unnecessary. 


There are several ways to choose the value of the penalty parameter }. 
It can be determined by cv, and algorithms exist to quickly compute B , for 
many values of à. Alternatively, penalty measures such as BIC may be used. 
Then the penalty based on the number of regressors p may be replaced by 


the effective degrees of freedom, which for ridge regression can be shown 
to equal >°%_,A3/(Aj + A), where A; are the eigenvalues of X’X. 


28.3.2 Lasso 


The lasso estimator, due to Tibshirani (1999), reduces MSE by setting some 
coefficients to zero. Additionally, the coefficients of retained variables are 
shrunk toward zero. Unlike ridge regression, the lasso can be used for 
variable selection. 


The lasso estimator 3, of 3 minimizes 
1 p 
Qx(B) = => (u — X48)? + AS 518; (28.2) 
= 


where A > 0 is a tuning parameter that needs to be provided and P is the 
number of potential regressors. Different values of the penalty parameter A 
lead to different lasso estimators. The regressors x are standardized, and 
some simpler implementations set x; = 1 for all j. 


The first term in the objective function is the sum of squared residuals 
minimized by the OLS estimator. The second term is a penalty measure 
based on the absolute value of the parameters, unlike the ridge estimator, 
which uses the square of the parameters. 


There is no explicit solution for the resulting lasso estimator. It can be 
shown that the lasso estimator sets all but k < p of the 8; coefficients to 
zero, where k is a decreasing function of ), while also shrinking nonzero 
coefficients toward zero. 


To see that some 8; may equal zero, suppose there are two regressors. 
The combinations of 3, and 6 for which the sum of squared residuals 
ee (yi — Biri: — b2 z2)? 18 constant define an ellipse. Different values 


of the sum of squared residuals correspond to different ellipses, given in 


figure 28.1, and the OLS estimator is the centroid of the ellipses. In general, 
the lasso can be shown to equivalently minimize SA (yi — x48)? subject 
to the constraint that De |8;| < s, where higher values of s correspond 
to lower values of ). Specializing to the case p = 2, and letting 

Kı = K2 = 1, we see the lasso constraint |61| + |82| < s defines the 
diamond-shaped region in the left panel of figure 28.1, and we are likely to 
wind up at one of the corners where 3, = 0 or G2 = 0. By contrast, the 
ridge estimator constraint 8? + 82 < s defines a circle, and a corner 
solution is very unlikely. 


by 


By 


Figure 28.1. Lasso versus ridge 


Note that lasso picks the best-fitting linear combination x’@ subject to 
the lasso constraint, rather than the best variables x. This is especially the 
case when variables are correlated, such as when potential regressors 
include powers and interactions of underlying variables. Thus, the exact 
variables selected will vary in repeated samples or if there is a different 
partition of the data into K folds. 


For prediction, lasso works best when a few of the potential regressors 
have 6; #0, while most 8; = 0. By comparison, ridge regression works 
best when many predictors are important and have coefficients of 


standardized regressors that are of similar size. The lasso is suited to 
variable selection, whereas ridge regression is not. 


Hastie, Tibshirani, and Friedman (2009, chap. 3.8) discuss several 
penalized or regularized estimators that can be viewed as variations of the 
lasso, including the grouped lasso, the smoothly clipped absolute deviation 
penalty, and the Dantzig selector. The lasso estimator is a special case of a 
more general method called least-angle regression. The community- 
contributed 1ars command (Mander 2006) implements least-angle 
regression; the a(lasso) option obtains lasso estimates. 


A thresholded or relaxed lasso performs an additional modified lasso 
using only those variables chosen by the initial lasso. The adaptive lasso 
presented in section 28.4.4 is an example. 


28.3.3 Elastic net 


In many applications, variables can be highly correlated with each other. 
The lasso penalty will drop many of these correlated variables, while the 
ridge penalty shrinks the coefficients of correlated variables toward each 
other. 


The elastic net combines ridge regression and lasso. This can improve 
the MSE, but will retain more variables than the lasso. The elasticnet 
command has objective function 


P — 
Qal) = 572 — X18)? +A} faalt HER) ess) 


i=l j=l 


The ridge penalty averages correlated variables, while the lasso penalty 
leads to sparsity. Ridge is the special case q = 0, and lasso is the special 
case q = 1. 


28.3.4 Finite sample distribution of lasso-related estimators 


A model-selection method is consistent if asymptotically it correctly selects 
the correct model from a selection of candidate models. A model-selection 
method is conservative if asymptotically it always selects a model that nests 
the correct model. Selecting a model on the basis of minimum BIC is a 
consistent model-selection procedure, while selecting a model on the basis 
of minimum AIC is conservative (Leeb and Potscher 2005). 


A statistical model-selection and estimation method is said to have an 
oracle property if it leads to consistent model selection and a subsequent 
estimator that is asymptotically equivalent to the estimator that could be 
obtained if the true model was known so that model selection was 
unnecessary. 


For example, suppose y; = ax1; + Gxo; + u; and the true model is one 
with either G = 0 or 8 Æ 0. A consistent model-selection method correctly 
determines whether 8 = 0. Let & be the estimator of a that first uses a 
model-selection method to determine whether 8 = 0 and then estimates 
whichever model is selected. Then & has the oracle property if its 
asymptotic distribution is the same as that for the infeasible estimator of a 
that directly fits the true model without initial model selection. 


The lasso is a consistent model-selection procedure but does not have 
the oracle property because of its bias. The adaptive lasso presented in 
section 28.4.4 is one of several variations of lasso that does have the oracle 


property. 


Unfortunately, the oracle property is an asymptotic property that, while 
potentially useful in some settings such as recognizing numbers on a license 
plate, does not carry over to the finite sample settings that economists 
encounter. Our models do not fit perfectly, and we expect that with more 
observations, we can detect more variables that predict the outcome of 
interest. Leeb and Potscher (2005), for example, consider the preceding 
example where 8 is of order O(1/v N). Then even though asymptotically 
the oracle property may still hold, & has a complicated finite sample 
distribution that is affected by the first-stage determination of whether 
8 = 0. In fact, & has MsE that can be very large and even larger than that if 
we simply estimated both a and 8 without first determining whether 6 = 0. 
Mathematically, this difference between finite sample and asymptotic 


performance is due to the asymptotic convergence not being uniform with 
respect to parameters. 


Thus, we cannot perform standard inference on lasso or postlasso OLS 
coefficient estimates. Instead, if inference on parameters is desired, some 
model structure is required, and more complicated estimation methods need 
to be used. These are presented in sections 28.8 and 28.9. 


28.4 Prediction using lasso, ridge, and elasticnet 


We present an application of prediction for linear models using the lasso and 
the related shrinkage estimators—tridge and elastic net. These methods can 
be adapted to binary outcome models (logit and probit) and exponential 
mean models (Poisson), and we provide a logit example. 


28.4.1 The lasso command 


Lasso estimates can be obtained using the lasso command, which has 
syntax 


lasso model depvar | Calwaysvars) | othervars [ of | [ in | [ weight | [ options | 


The model is one of linear, logit, probit, OF poisson. The variables 
alwaysvars are variables to be always included, while the lasso selects 
among the othervars variables. The penalty ) can be determined by cv 
(option selection (cv) ), adaptive CV (option selection (adaptive) ), BIC 
(option selection (bic) ), or a plugin formula (option selection (plugin) ). 
For cv, the options include folds (#) for the number of folds. The plugin 
methods are intended for nonprediction use of the lasso; see section 28.8. 
Other options set tolerances for optimization. 


The lasso command actually uses 1/2N rather than 1/N in (28.2). For 
clustered data, the option cluster(clustvar) defines the objective function 
to be the average over clusters of the within-cluster sums. Then (28.2) 
becomes 


is] i = 
Q(B) = BY} i 2 Ui- XB)? 9 + ADI sll 
: = 


cv then selects folds at the cluster level, which requires a considerable 
number of clusters. 


The lasso command output focuses on determination of the penalty }. 
Postestimation commands lassoinfo, lassoknots, lassoselect, cvplot, 
lassocoef, coefpath, lassogof, and bicplot provide additional 
information. These commands are illustrated below. 


The elasticnet command has syntax, options, and postestimation 
commands similar to those for the lasso command. Ridge estimates can be 
obtained using the elasticnet command with option alpha (0). Stata also 
includes a sqrtlasso command, which is seldom used and is not covered 
here. 


28.4.2 Lasso linear regression example 


We apply the lasso linear command to the current data example. Because 
there are only 40 observations, we use 5-fold cv rather than the default of 10 
folds. The five folds are determined by a random-number generator, so for 
replicability, we need to set the seed. 


* Lasso linear using 5-fold CV 
. lasso linear y x1 x2 x3, selection(cv, folds(5)) rseed(10101) 


5-fold cross-validation with 100 lambdas 
Grid value 1: lambda = 1.591525 no. of nonzero coef. = 


Folds: 1...5 CVF = 11.85738 
Grid value 2 lambda = 1.450138 no. of nonzero coef. = 
Folds: 1...5 CVF = 11.60145 
Grid value 3 lambda = 1.321312 no. of nonzero coef. = 
Folds: 1...5 CVF = 11.2296 
Grid value 4 lambda = 1.20393 no. of nonzero coef. = 
Folds: 1...5 CVF = 10.87719 
Grid value 5 lambda = 1.096976 no. of nonzero coef. = 
Folds: 1...5 CVF = 10.60149 
Grid value 6 lambda = .9995238 no. of nonzero coef. = 
Folds: 1...5 CVF = 10.38463 
Grid value 7 lambda = .9107289 no. of nonzero coef. = 
Folds: 1...5 CVF = 10.20522 
Grid value 8 lambda = .8298222 no. of nonzero coef. = 
Folds: 1...5 CVF = 10.05685 
Grid value 9 lambda = .7561031 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.934201 
Grid value 10: lambda = .688933 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.829713 
Grid value 11: lambda = .6277301 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.739804 
Grid value 12: lambda = .5719643 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.666469 
Grid value 13: lambda = .5211525 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.606777 
Grid value 14: lambda = .4748548 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.562824 
Grid value 15: lambda = . 43267 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.525748 
Grid value 16: lambda = .3942328 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.493472 
Grid value 17: lambda = .3592102 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.460115 
Grid value 18: lambda = .327299 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.43311 
Grid value 19: lambda = .2982226 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.411316 
Grid value 20: lambda = .2717294 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.393794 
Grid value 21: lambda = .2475897 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.393523 
Grid value 22: lambda = .2255945 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.40661 
Grid value 23: lambda = .2055533 no. of nonzero coef. = 
Folds: 1...5 CVF = 9.420332 


Grid value 24: lambda = .1872925 no. of nonzero coef. = 2 
Folds: 1...5 CVF = 9.434326 


. cross-validation complete ... minimum found 


Lasso linear model No. of obs = 40 
No. of covariates = 
Selection: Cross-validation No. of CV folds = 

No. of Out-of- CV mean 

nonzero sample prediction 

ID Description lambda coef. R-squared error 

1 first lambda 1.591525 0 -0.0519 11.85738 

20 lambda before .2717294 2 0.1666 9.393794 

* 21 selected lambda . 2475897 2 0.1666 9.393523 

22 lambda after . 2255945 2 0.1655 9.40661 

24 last lambda . 1872925 2 0.1630 9.434326 


* lambda selected by cross-validation. 


The default grid for A is a decreasing logarithmic grid of 100 values with 
A; = Ar x 10740=1)/99 j = 2,...,100, where A, is the smallest value at 
which no variables are selected. Here \, = 1.591525, and, for example, 
dg = Ay x 1074/99 = 1.591525 x 0.97700996 = 1.450138. 


The output shows that reducing the penalty to A> = 1.450 led to the 
inclusion of a regressor and that reducing the penalty to \,; = 0.628 led to 
the inclusion of a second regressor. The cv objective function continued to 
decline to a minimum value of 9.394 at \.; = 0.248. The results are listed 
only to the 24th largest grid value of A, rather than all 100 grid-point values, 
because the minimum cv value has already been attained by then. 


28.4.3 Lasso postestimation commands example 


The lassoknots command provides a summary of the values of å at which 
variables are selected or deselected. Additionally, it lists which variables 
were selected or deselected. 


. * List the values of lambda at which variables are added or removed 
. lLassoknots 


No. of CV mean 
nonzero pred. Variables (A)dded, (R)emoved, 
ID lambda coef. error or left (U)nchanged 
2 1.450138 1 11.60145 | A x1 
11 .6277301 2 9.739804 | A x2 
* 21 . 2475897 2 9.393523 | U 
24 . 1872925 2 9.434326 U 


* lambda selected by cross-validation. 


In this example, once a variable is selected, it remains selected. The option 
alllambdas gives results for all knots, and the option display() can 
produce additional statistics at each listed knot such as the number of 
nonzero coefficients and BIC. 


The lassoselect command enables specifying a particular value for the 
optimal value \* following lasso with the option selection (cv). The 
command lassoselect ID=11 will change the selected \* to that with 
TD=11 (here \*= 0.6277301). And lassoselect lambda=0.50 will set \* to 
the grid value closest to 0.50 (here A;3 = 0.5211525). 


The cvplot command plots the cv objective function against A on a 
logarithmic scale or reverse logarithmic scale (the default) or plots the cv 
objective function against D |B; |- 


. * Plot the change in the penalized objective function as lambda changes 
. cvplot 


The plot is given in the first panel of figure 28.2 and shows that cv decreases 
as the penalty à decreases until X — 0.248, at which point cv begins to 
increase. 
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Figure 28.2. Plots of cv values and coefficients as ) changes 


The coefpath command provides a similar plot for the standardized 
coefficients of each selected variable. 


* Plot how estimated coefficients change with lambda 
coefpath, xunits(rlnlambda) 


The plot is given in the second panel of figure 28.2. In this example, once a 
variable is selected, it remains selected. 


The lassoinfo command provides a summary of the lasso estimation. 


* Provide a summary of the lasso 
lassoinfo 


Estimate: active 
Command: lasso 


No. of 

Dependent Selection Selection selected 
variable Model method criterion lambda variables 
y linear cv CV min. .2475897 2 


The lassocoef command with the display (coef, ) option can provide 
three different sets of estimates for in-sample regression of y on the lasso- 
selected regressors. 


Standardized coefficients (the default) are the estimates from lasso of y 
on the standardized regressors. These are the estimates directly obtained by 
the lasso at \*. 


* Lasso coefficients for the standardized regressors 
. lassocoef, display(coef, standardized) 


active 

x1 1.206056 

x2 .2715635 
_cons (0) 


Legend: 
b - base level 
e - empty cell 
o - omitted 


Penalized coefficients are the preceding standardized lasso estimates 
rescaled so that the standardization of variables is removed. These estimates 
can be interpreted in terms of the original data before standardization. 


. * Lasso coefficients for the unstandardized regressors 
. lassocoef, display(coef, penalized) nolegend 


active 
xi 1.35914 
x2 . 2918877 


_cons 2.617622 


The penalized coefficients are similar to the standardized coefficients 
because in this example the variables x1 and x2 had variances close to one. 


Postselection coefficients are obtained by OLS regression of y on the 
lasso-selected regressors, here xı and x2. These are sometimes referred to as 
postlasso OLS estimates. Belloni and Chernozhukov (2013) find that the 
postlasso OLS estimator has lower bias and rate of convergence at least as 
good as the lasso estimator, and this is the case even if the lasso fails to 
include some relevant variables. 


* Postselection estimated coefficients for the unstandardized regressors 
lassocoef, display(coef, postselection) nolegend 


active 


xi 1.544198 
x2 . 4683922 
_cons 2.533663 


As expected, the lasso-penalized estimates of the coefficients of the selected 
variables (1.359 and 0.292) were smaller than these OLS estimates. 


The lassogof command provides the goodness of fit with penalized 
coefficients (the default) or with postselection coefficients. We have 


. * Goodness of fit with penalized coefficients and postselection coefficients 
. lassogof, penalized 


Penalized coefficients 


MSE R-squared Obs 


8.679274 0.2300 40 


. lassogof, postselection 


Postselection coefficients 


MSE R-squared Obs 


8.597958 0.2372 40 


The postselection estimator is OLS, which maximizes R2 because it 
minimizes the sum of squared residuals. The lasso added a penalty, which 
necessarily leads to smaller in-sample R2. The difference here between 
0.2372 and 0.2300 is not great. 


Finally, we verify that the postselection estimates are indeed obtained by 
OLs of y on the lasso-selected variables. 


. * Compare with OLS with the lasso-selected regressors 


. regress y X1 x2, noheader 


y | Coefficient Std. err. t P>|t| [95% conf. interval] 
x1 1.544198 .6305617 2.45 0.019 .2665582 2.821837 
x2 .4683922 .6014166 0.78 0.441 -.7501936 1.686978 
_cons 2.533663  .5159805 4.91 0.000 1.488188 3.579139 
28.4.4 Adaptive lasso 
The adaptive lasso (Zou 2006) is a multistep lasso method that usually leads 


to fewer variables being selected compared with the basic cv method. 


The preceding analysis set x; = 1 in (28.2). Adaptive lasso also begins 
with regular cv lasso (or CV ridge) with x; = 1. Adaptive lasso then does a 
second lasso that excludes variables with B;=0 and for the remainder sets 


Ry =l IB; |? with default § — 1, which favors variables with a larger 


coefficient because they receive a smaller penalty. The default is to have one 
adaptive step, but additional adaptive steps can be requested. 


For the current example with one adaptive step, we obtain 


. * Lasso linear using 5-fold adaptive CV 
. qui lasso linear y x1 x2 x3, selection(adaptive, folds(5)) rseed(10101) 


. lassoknots 
No. of CV mean 
nonzero pred. Variables (A)dded, (R)emoved, 

ID lambda coef. error or left (U)nchanged 
26 3.945214 1 11.60145 A xi 

* 52 . 3512089 1 9.160539 U 
57 . 2205694 2 9.210699 A x2 
95 .0064297 2 9.172378 U 


* lambda selected by cross-validation in final adaptive step. 


Now the optimal choice of A leads to only x1 being selected. 


The selection (none) option fits at each value of A on the grid but does 
not select an optimal value of A. 


. qui lasso linear y x1 x2 x3, selection(none) 


. lassoknots 
No. of 
nonzero In-sample Variables (A)dded, (R)emoved, 
ID lambda coef. R-squared or left (U)nchanged 
2 1.450138 1 0.0382 A x1 
11 .6277301 2 0.1908 A x2 
52 .0138423 3 0.2372 A x3 
62 .0054597 3 0.2373 U 


Note: No lambda selected. lassoselect can be used to select lambda. 


For example, all three variables are selected if A < 0.0138423. 


The selection(bic) option uses the BIC, computationally faster than 
using CV, with the number of parameters set to the number of nonzero 
coefficients. The default is to evaluate the Bic at the penalized coefficients; 
the selection(bic, postselection) option instead evaluates at the 
postselection coefficients. The BIC is computed using a quasilikelihood 
function that assumes independence of observations, so care is needed in 
using it with clustered data. 


The selection(plugin) option uses a plugin iterative procedure to 
determine )*. This option is intended for use in estimation, rather than 
prediction, and is presented in section 28.8. 


28.4.5 elasticnet command and ridge regression 


The elasticnet command for ridge and elastic net estimation has syntax, 
options, and postestimation commands similar to those for the lasso 
command. 


For the elastic net objective function given in (28.3), ridge regression is 
the special case œ = 0, and lasso is the special case q = 1. Similarly, for the 
elasticnet command, the option alpha (0) implements ridge regression, 
and the option alpha(1) implements lasso. 


We begin with ridge regression, using the option alpha (0). Using five- 
fold cv to obtain the optimal ), we have 


. * Ridge estimation using the elasticnet command and selected results 
. qui elasticnet linear y x1 x2 x3, alpha(0) rseed(10101) selection(cv, folds(5)) 


. lassoknots 
No. of CV mean 
nonzero pred. Variables (A)dded, (R)emoved, 
alpha ID lambda coef. error or left (U)nchanged 
0.000 
1 1591.525 3 11.9595 A x1 x2 
x3 
* 93 . 3052401 3 9.54017 U 
100 . 1591525 3 9.566065 U 


* alpha and lambda selected by cross-validation. 


. lassocoef, display(coef, penalized) nolegend 


active 

x1 1.139476 

x2 . 4865453 

x3 .0958546 
_cons 2.659647 


lassogof, penalized 


Penalized coefficients 


MSE R-squared Obs 


8.70562 0.2277 40 


The ridge coefficient estimates are on average shrunken toward 0 compared 
with the OLS slope estimates of, respectively, 1.555, 0.471, and — 0.026, 
given in section 28.2. And R2 has fallen from 0.2373 to 0.2277. 


For elastic net regression, the elasticnet command performs a two- 
dimensional grid search over both ) and a. The default for ) is the same 
logarithmic grid with 100 points as used by lasso, while a = 0.5, 0.7, 1.0. 
For this example, the defaults led to q = 1, so elastic net reduced to lasso. 
We specify a narrower grid that leads to œ = 0.95. 


. * Elastic net estimation and selected results 
. qui elasticnet linear y x1 x2 x3, alpha(0.9(0.05)1) rseed(10101) 
> selection(cv, folds(5)) 


. lassoknots 
No. of CV mean 
nonzero pred. Variables (A)dded, (R)emoved, 

alpha ID lambda coef. error or left (U)nchanged 
1.000 

4 1.450138 1 11.60145 A x1 

13 .6277301 2 9.739804 A x2 

26 . 1872925 2 9.434326 U 
0.950 

29 1.591525 1 11.73019 A x1 

38 . 688933 2 9.81611 A x2 

* 48 .2717294 2 9.3884 U 

51 . 2055533 2 9.425887 U 
0.900 

53 1.675289 1 11.74015 A xi 

62 . 7561031 2 9.900317 A x2 

76 . 2055533 2 9.431641 U 


* alpha and lambda selected by cross-validation. 


. lassocoef, display(coef, penalized) nolegend 


active 

xi 1.329744 

x2 . 2908281 
_cons 2.627567 


lassogof, penalized 


Penalized coefficients 


MSE R-squared Obs 


8.693386 0.2288 40 


The optimal values of a = 0.95 and A = 0.2717 lead to selection of x1 and 
x2. The penalized coefficient estimates and MSE are close to the lasso 
estimates, the case q = 1. In real-life examples with many more regressors, 
we expect a bigger difference. 


28.4.6 In-sample comparison of shrinkage estimators 


The results across several shrinkage model estimates can be compared using 
the postestimation commands lassocoef, lassogof, and lassoinfo. 


First, save model results using the estimates store command. 


x Fit various models and store results 
. qui regress y X1 x2 x3 


estimates store OLS 

. gui lasso linear y x1 x2 x3, selection(cv, folds(5)) rseed(10101) 
estimates store LASCV 
qui lasso linear y x1 x2 x3, selection(adaptive, folds(5)) rseed(10101) 
estimates store LASADAPT 

. qui lasso linear y x1 x2 x3, selection(plugin) 
estimates store LASPLUG 


. qui elasticnet linear y x1 x2 x3, alpha(0) selection(cv, folds(5)) 
> rseed(10101) 


estimates store RIDGECV 


. qui elasticnet linear y x1 x2 x3, alpha(0.9(0.05)1) rseed(10101) 
> selection(cv, folds(5)) 


estimates store ELASTIC 


We compare in-sample model fit and the specific variables selected. The 
comparison below uses penalized coefficient estimates for standardized 
variables. For unpenalized postselection estimates of unstandardized 
variables, use lassogof option postselection and lassocoef option 


display(coef, postselection). 


. * Compare in-sample fit and selected coefficients of various models 


. lassogof OLS LASCV LASADAPT LASPLUG RIDGECV ELASTIC 


Penalized coefficients 


Name MSE R-squared Obs 
OLS 8.597403 0.2373 40 
LASCV 8.679274 0.2300 40 
LASADAPT 8.755573 0.2232 40 
LASPLUG 10.23264 0.0922 40 
RIDGECV 8.70562 0.2277 40 
ELASTIC 8.693386 0.2288 40 

. lassocoef OLS LASCV LASADAPT LASPLUG RIDGECV ELASTIC, display(coef) nolegend 

OLS LASCV = LASADAPT LASPLUG RIDGECV ELASTIC 

x1 1.555582 1.206056 1.462431 3693423 1.011134 1.179972 

x2 -4707111 . 2715635 . 452667 . 2705777 

x3 | -.0256025 .0979251 
_cons 2.531396 0 0 0 0 0 


cv lasso selects x1 and x2, while adaptive lasso, which provides a bigger 
penalty than cv lasso, selects only x1. Ridge by construction retains all 
variables, while the elastic net in this example selected x1 and x2. 


All methods have similar in-sample MSE and 2, aside from plugin lasso. 


If we had additionally made out-of-sample predictions, then we expect that 
the shrinkage estimators would predict better than OLS with regressors x1, x2, 
and x3. 


28.4.7 Shrinkage for logit, probit, and Poisson models 


In principle, the lasso, ridge, and elastic net penalties can be applied to 
objective functions other than the sum of squared residuals used for linear 
regression. 


In particular, for generalized linear models, the objective function uses 
the sum of squared deviance residuals, defined in section 13.8.3, rather than 
the sum of squared residuals. The lasso and elasticnet commands can 
also be applied to logit, probit, and Poisson models. 


The squared residual (y; — x/3)? in (28.1), (28.2), and (28.3) is replaced 
by the squared deviance residual. For logit, this term is 
2[y; ln A(x} 3) + (1 — yi) In{1 — A(x/)}], where A(z) = e7/(1 + e7). For 
probit, we use 2|y; In ®(x,G) + (1 — yi) n{1 — ®(x/G)}], where ®(-) is the 
standard normal cumulative distribution function. For Poisson, we use 
2{y:xi B — exp(x; B) — vi}, where v; = 0 if y; = 0 and v; = y; In(y:) — yi 
otherwise. 


The related Stata commands in the case of lasso are, respectively, lasso 
logit, lasso probit, and lasso poisson. 


To illustrate the method for a binary variable, we convert y to a variable 
dy that takes value 1 ify > 3 and implement lasso for a logit model with ) 
determined by five-fold cv. 


. * Lasso for logit example 
. qui generate dy = y > 3 


. qui lasso logit dy x1 x2 x3, rseed(10101) selection(cv, folds(5)) 


. lassoknots 
No. of 
nonzero CV mean Variables (A)dded, (R)emoved, 
ID lambda coef. deviance or left (U)nchanged 


2 . 2065674 

* 24 .0266792 
26 0221495 
30 0152668 
31 .0139106 


407613 | A x1 
192646 | U 
.192865 | A x2 
.194545 | A x3 
.195055 | U 
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* lambda selected by cross-validation. 
The optimal A leads to selection of only x1. 


For a count example, we create a Poisson variable ycount that takes 
values between 0 and 7 and whose mean depends on only x1. lasso with 
five-fold cv yields 


* Lasso for count data example 
. qui generate ycount = rpoisson(exp(-1 + x1)) 


. qui lasso poisson ycount x1 x2 x3, rseed(10101) selection(cv, folds(5)) 


. lassoknots 
No. of 
nonzero CV mean Variables (A)dded, (R)emoved, 
ID lambda coef. deviance or left (U)nchanged 
2 1.012329 1 2.191141 A xi 
* 25 . 119132 1 .8257619 U 
29 .0821131 2 .8334985 A x3 


* lambda selected by cross-validation. 


Again, the optimal leads to selection of only x1. 


28.5 Dimension reduction 


Dimension-reduction methods reduce the number of regressors from P to 
m < p linear combinations of regressors. Thus, given initial model 

y = bo + XB + u, where X is N x p, we form matrix Z = XA, where A 
is p x mand Z is N x m. Then, we fit the model y = yọ + Zy + v. 


Here we present principal components, a long-standing method that uses 
only X to form A (unsupervised learning). A related method 1s partial least 
squares, which additionally uses the relationship between y and X to form 
A (supervised learning). Principal components is the method most often 
used in econometrics studies and can be used in a very wide range of 
applications. 


28.5.1 Principal components 


The principal components method selects the linear combinations of 
regressors, called principal components, as follows. The first principal 
component has the largest sample variance among all normalized linear 
combinations of the columns of X. The second principal component has the 
largest sample variance subject to being orthogonal to the first, and so on. 
More formally the jth principal component is the N x 1 vector Xh;, where 
h; is the eigenvector corresponding to \,, the jth largest eigenvalue of x’X. 


The principal components are not invariant to the scaling of X, and it is 
common practice to apply principal components to data that have been 
standardized to have mean zero and variance one. Let X* denote the 
regressor matrix after this standardization. The Stata pca command 
computes the principal components. The default option is the correlation 
option that is equivalent to automatically standardizing the data before 
analysis. So with this default option, there is no need to first standardize the 
regressors. We obtain 


. * Principal components using default option that first standardizes the data 
. pea x1 x2 x3 


Principal components/correlation Number of obs = 40 
Number of comp. = 3 
Trace = 3 
Rotation: (unrotated = principal) Rho = 1.0000 
Component Eigenvalue Difference Proportion Cumulative 
Comp1 1.81668 1.08919 0.6056 0.6056 
Comp2 . 727486 27165 0.2425 0.8481 
Comp3 - 455836 ; 0.1519 1.0000 


Principal components (eigenvectors) 


Variable Comp1 Comp2 Comp3 | Unexplained 


xl 0.6306 -0.1063 -0.7688 0 
x2 0.5712 -0.6070 0.5525 0 
x3 0.5254 0.7876 0.3220 0 


The output includes the three eigenvalues and three eigenvectors. The data 
are standardized automatically, so each of the three variables has variance 
one, and the sum of the variances is three. The variance of each principal 
component equals the corresponding eigenvalue, so the first principal 
component has variance 1.81668 and explains a fraction 

1.81668/3 = 0.6056 of the total variance. 


Note that the same results are obtained if we use the previously 
standardized variables zx1—zx3 and the covariance option, specifically, 
command pca zx1 ZX2 zx3, covariance. 


The predict postestimation command constructs variables equal to the 
three principal components. We obtain 


. * Compute the three principal components and their means, st.devs., correlations 
. predict pci pc2 pc3 
(score assumed) 


Scoring coefficients 


sum of squares(column-loading) = 1 
Variable Comp1 Comp2 Comp3 
zx1i 0.6306 -0.1063 -0.7688 
ZxX2 0.5712 -0.6070 0.5525 
Zx3 0.5254 0.7876 0.3220 


. summarize pcl pc2 pc3 


Variable Obs Mean Std. dev. Min Max 
pel 40 -3.35e-09 1.347842 -2.52927 2.925341 
pc2 40 -3.63e-09 .8529281 -1.854475 1.98207 
pc3 40 2.08e-09 .6751564 -1.504279 1.520466 
. correlate pcl pc2 pc3 
(obs=40) 
pel pc2 pc3 
pel 1.0000 
pc2 0.0000 1.0000 
pc3 -0.0000 -0.0000 1.0000 


The principal components have mean zero, standard deviation equal to the 
square root of the corresponding eigenvalue, for example, 


4/1.81668 = 1.3478, and are uncorrelated. 


The principal components are computed applying the relevant 
eigenvectors to the standardized variables. For example, the first principal 
component is computed as follows: 


* Manually compute the first principal component and compare to pci 
. generate double pcimanual = 0.6306*zx1 + 0.5712*zx2 + 0.5254*zx3 


summarize pc1 pcimanual 


Variable Obs Mean Std. dev. Min Max 
pel 40 -3.35e-09 1.347842 -2.52927 2.925341 
pcimanual 40 -9.02e-18 1.347822 -2.529204 2.925356 


The principal components are obtained without any consideration of 
regression on a variable y. If we regress y on all p principal components, we 
necessarily get the same predicted values of y and the same R2 as if we 


regress Y on all the original p regressors. The hope is that if we regress y on 
just, say, the first m < p principal components, then we obtain fit better than 
that obtained by arbitrarily picking m regressors and not much worse than if 
we used all p regressors. There is no guarantee this will happen, but in 
practice it often does. 


The following example gives correlations of the dependent variable with 
fitted values from regression on, respectively, all three regressors, the first 
principal component, x1, x2, and x3. Recall that the square of these 
correlations equals R2 from the corresponding OLS regression. 


. * Compare R from OLS on all three regressors, on pci, on x1, on x2, on x3 
. qui regress y x1 x2 x3 


. predict yhat 
(option xb assumed; fitted values) 


. correlate y yhat pc1 x1 x2 x3 


(obs=40) 
y yhat pel xl x2 x3 

y 1.0000 

yhat 0.4871 1.0000 

pel 0.4444 0.9122 1.0000 
xl 0.4740 0.9732 0.8499 1.0000 
x2 0.3370 0.6919 0.7700 0.5077 1.0000 
x3 0.2046 0.4200 0.7082 0.4281 0.2786 1.0000 


There is some loss in fit because of using only the first principal component. 
The correlation has fallen from 0.4871 to 0.4444, corresponding to a fall in 
R2 from 0.237 to 0.197. Regression on x1 alone has better fit, as expected 
because the DGP in this example depended on x1 alone, while regressions on 
x2 alone and on x3 alone do not fit nearly as well as regression on the first 
principal component. 


28.6 Machine learning methods for prediction 


Machine learning methods are algorithms that determine a predictor using 
only the data at hand, rather than by using a researcher-specified model. 


The terms “machine learning”, “statistical learning”, and “data science” are 
to some extent interchangeable. 


The lasso and other shrinkage estimators are leading examples of 
methods used in machine learning. In this section, we present additional 
machine learning methods. 


The machine learning literature distinguishes between supervised 
learning, where an outcome variable y is observed, and unsupervised 
learning, where no outcome variable is observed. Within supervised 
learning, distinction is made between an outcome measured on a cardinal 
scale, most often continuous, and an outcome that is categorical. The latter 
case is referred to as classification. 


Machine learning methods are often applied to big data, where the term 
“big data” can mean either many observations or many variables. It 
includes the case where the number of variables exceeds the number of 
observations, even if there are relatively few observations. 


By allowing potential regressors to include powers and interactions of 
underlying variables, a linear (in parameters) model used by shrinkage 
estimators such as the lasso may actually explain the outcome sufficiently 
well. Other machine learning methods, such as neural networks and 
regression trees, do not transform the underlying variables but instead fit 
models that can be very nonlinear in these underlying variables. 


The following overview summarizes some additional methods for 
prediction, many from the machine learning literature. Some of these 
methods are illustrated in the subsequent application section. The 
presentation is very dense and more advanced than much of the other 
material in this book. Little detail 1s provided on these methods, such as 
determination of necessary tuning parameters akin to A for the lasso, though 
see chapter 27 for nonparametric and semiparametric methods. For more 


details see, for example, James et al. (2021) or Hastie, Tibshirani, and 
Friedman (2009). 


28.6.1 Supervised learning for continuous outcome 


We have continuous outcome y that, given predictors x, 1s predicted by 
function g(x). OLS uses g(x) = x’ or, more precisely, g(x) = z’ 3, where 
the regressors z are specified functions of x such as transformations and 
interactions. For simplicity, we do not distinguish between the underlying 
variables x and the regressors z formed from x. 


A quite general model for g(x) is a fully nonparametric model such as 
kernel regression or local polynomial regression, with the function g(-) 
unspecified. Such models can be fit using the npregress command; see 
section 27.2.5. But this yields an imprecise estimate of g(-) for high- 
dimensional x, a problem referred to as the curse of dimensionality, and the 
method is not suited to prediction outside the domain of x. 


The econometrics literature has sought to overcome the curse of 
dimensionality by fitting semiparametric models that reduce the 
dimensionality of the nonparametric component, enabling estimation and 
inference on the parametric component. The leading examples—partial 
linear, single-index, and generalized additive models—were presented in 
sections 27.6—27.8. These semiparametric models are used to obtain 
estimates of parameters or partial effects, rather than for prediction per se. 
In section 28.8, we present estimation of key parameters in a partial linear 
model using lasso to select control variables. 


28.6.2 Neural networks 


Neural networks lead to quite flexible nonlinear models for g(x). These 
models introduce a series of hidden layers between the outcome y and the 
regressors x. Deep learning methods use neural networks. 


For example, a neural network with two layers introduces an 
intermediate layer between input variables x and the output y. The 
intermediate layer is composed of M intermediate units or hidden variables 


Zm; M =1,...,M, that are each a nonlinear transformation of a linear 
combination of the inputs x, so zm = g(Qom + x’Qm) for specified 
function g(-). 


Initial research often used the sigmoid function g(v) = 1/(1 + e7”). 
More recently, it is common to use rectified linear units with 
g(v) = max(0, v) . The output is then a linear combination of the M hidden 
units, or a transformation of this linear combination, so E(y|x) = h(t), 
where t = 69 + z’G and usually h(t) = t. Given g(v) = 1/(1 + e7”) and 
h(t) = t, a two-layer neural network reduces to the nonlinear model 
E(y|x) = Bo + Da Bj /{1 + e~ Com +x’ am) 1. If MSE is the loss function, 
then estimation of the various a and 8 parameters is a nonlinear least 
squares problem. 


More complicated neural net models add additional hidden layers. 
There is a tendency to overfit, and a ridge-regression-type penalty may be 
used. There is an art to estimation of neural network models because they 
entail several tuning parameters—the number of layers, the number of 
hidden variables in each layer, the function g(-), and a penalty term for 
overfitting. 


Neural network models are highly nonlinear. Estimation uses stochastic 
gradient descent, a variation of gradient-based methods that at each step 
calculates the gradient at a randomly chosen subset of the observations. 


28.6.3 Regression trees 


Regression trees sequentially split regressors x into regions that best predict 
y. The prediction of y at a given value Xo that falls in a region R* is then 
the average of y across all observations for which x € R*. This is 
equivalent to regression of y on a set of mutually exclusive indicator 
variables where each indicator variable corresponds to a given region of x 
values. 


Suppose we first split on the jth variable £; at point s. Defining regions 
R1 (j, s) = {x|x; < s} and R2(j, s) = {x|x; > s}, the MSE is 
N = = 
(1/N) Dix eR (s) (Yi =u) PUN) D (yi — Yr2)”. The 


first split is based on a search over regressors zj, j = 1,...,p and split 
points s to obtain (j, s) that minimizes this MSE. We then next search over 
possible splits of R1 and R2, with possible split on any of the p regressors, 
and choose the additional split that minimizes MSE, and so on. 


After K splits, the Msg is (1/N) 55%; J ixicrk (Yi — Trk)» Where 
Rk denotes the «th terminal node. The prediction for xg € Rk is then the 
average of y over all x; € Rk. So 
g(Xo) = es D scenk 1(x € Rey Wes dix ERK 1(x; € Rk)}, 
where 1(A) is an indicator function equal to one if event A occurs and 
equal to zero otherwise. 


Implementation requires specification of the depth of the tree and the 
minimum number of observations in the terminal nodes of the tree. The 
method takes a so-called greedy approach that determines the best split at 
each step without looking ahead and picking a split that could lead to a 
better tree in some future step. Thus, changes in the residual sum of squares 
is not used as a stopping criterion because better splits may still be possible. 
Instead, it is best to overfit with more splits than may be ideal and then 
prune back using a penalty function such as A|T|, where |T| is the number 
of terminal nodes. 


Simple regression trees have the advantage of interpretability if there 
are few regressors. However, predictions from a single regression tree have 
high variance. For example, splitting the sample into two can lead to two 
quite different trees. 


28.6.4 Bagging 


Bagging and boosting are general methods for improving prediction that 
work especially well for regression trees. 


Bagging, a shortening of “bootstrap aggregating”, reduces prediction 
variance by obtaining predictions for several different samples and 
averaging these predictions. The different samples are obtained by bootstrap 
that randomly chooses N observations with replacement from the original 
sample of NV observations. 


Specifically, for each of the b= 1,..., B bootstrap samples, we obtain 
a large tree and prediction g(x) and then use the average prediction 
Grag (X) =(1/B) san g(x). Because sampling is with replacement, some 
observations will appear in the bootstrap multiple times, while others will 
not appear at all. The observations not in a bootstrap sample can be used as 
a test sample—this replaces cv. 


28.6.5 Random forests 


The B bagging estimates will be correlated because the bootstrap samples 
have considerable overlap. This is especially the case for regression trees 
because if a regressor is especially important, it will appear near the top of 
the tree in every bootstrap sample. 


A random forest adjusts bagging for regression trees as follows: within 
each bootstrap sample, each time a split is considered, only a random 
sample of m < p predictors is used in deciding the next split. Compared 
with a single regression tree, this adds m as an additional tuning parameter; 
often, m is set to the first integer greater than ,/p. 


Random forests are related to kernel and ķ-nearest neighbors because 
they use a weighted average of nearby observations. Random forests can 
predict better because they have a data-driven way of determining which 
nearby observations get weight; see Lin and Jeon (2012). 


28.6.6 Boosting 


Boosting methods construct multiple predictions from reweighted data 
using the original sample, rather than by bootstrap resampling, and use as 
predictor a combination of these predictions. There are many boosting 
algorithms. 


A common boosting method for regression trees for continuous 
outcomes sequentially updates the initial tree by applying a regression tree 
to residuals obtained from the previous stage. Specifically, given the bth 
stage model with predictions g°(x), fit a decision tree h(x) to residuals pb, 
defined below, rather than to the outcome y. Then, update 


gett (x) = 9? (x) + Ah? (x) where à is a penalty parameter, and update the 
residuals pot! — pb — Ah? (x). The boosted prediction is 


A B a 
Jooost(X) = (1/B) 75-1 P(X). 
28.6.7 Supervised learning for categorical outcome (classification) 


Digital license plate recognition provides an example of categorical 
classification. Given a digital image of a number or letter, we aim to 
correctly categorize it. 


With i categories, let y take values 1,2,..., A. The standard loss 
function used is the error rate that counts up the number of wrong 
classifications. Then, 


N 
1 
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where the indicator function 1(A) = 1 if event 4 happens and equals 0 
otherwise. 


One method to predict y is to apply a standard parametric model for 
categorical data such as binary logit in the case of two categories, or more 
generally, multinomial logit, that yields predicted probabilities of being in 
each category. Then, allocate the ¿th observation to the category with the 
highest predicted probability. In the case of binary logit with outcome Y 
taking values 1 or 2, we let y; = 2 if the predicted probability 
P(y; = 2) > 0.5 and let F; = 1 otherwise. 


The following methods are felt to lead to classification with lower error 
rate than methods based on directly modeling Pr(y = k|x). 


Discriminant analysis specifies a joint distribution for (y, x). For linear 
discriminant analysis with g categories, in the th category, we suppose 
x|y = k ~ N(,, X) and define my = Pr(y = k). Then, we obtain an 


expression for Pr(y = k|x) from Bayes theorem, evaluate this at sample 
estimates for Hg, 4, and Tk, k = 1,2,..., K, and assign the 7th observation 
to category k with the largest estimated Pr(y; = k|x,;). The procedure is 
called linear discriminant analysis because the resulting classification rule 
can be shown to be a linear function of x. 


Quadratic discriminant analysis amends linear discriminant analysis by 
supposing x|y = k ~ N (ug, x), so additionally the variance of x varies 
across categories. The procedure is called quadratic discriminant analysis 
because the resulting classification rule can be shown to be a quadratic 
function of x. 


The preceding classifiers are restrictive because they define a boundary 
that is linear or quadratic. For example, if K = 2, then a linear classifier 
predicts y = 2 according to whether x’a > b for model-determined 
coefficients a and b. This rules out more flexible classifiers such as 
predicting y = 2 if x lies in a closed region and predicting y = 1 if x lies 
outside this closed region. A support vector machine allows such nonlinear 
boundaries; leading examples use what is called a polynomial kernel or a 
radial kernel. 


The discrim command includes linear, quadratic, and k-nearest- 
neighbor discriminant analysis. For support vector machines, see the 
community-contributed svmachines command (Guenther and 
Schonlau 2016). 


28.6.8 Unsupervised learning (cluster analysis) 


In unsupervised learning, there is no observed outcome y, only predictors x. 
The goal is to form one or more groups or clusters for which the predictors 
x take similar values. An example is determining several types of 
individual personalities on the basis of a range of psychometric measures. 


Principal components, introduced in section 28.5.1, provide one 
method. Then the first principal component defines the first group, the 
second principal component defines the second group, and so on. 


K-means clustering forms K distinct clusters for which the sum of 
within-cluster variation is minimized. A common measure of variation is 
Euclidean distance, in which case we choose clusters C),...,CK to 
minimize \~*_, W (Cp), where 
W (Cx) = (1/Nk) Xi vec, Dopai (tej — Zij)? and Np is the number of 
observations in cluster C;. 


Hierarchical clustering methods are sequential methods that start with 
many clusters that are combined to form fewer clusters, or that start with 
one cluster and then split clusters to form more clusters, until an optimal 
number of clusters is obtained. 


The cluster kmeans command implements -means clustering for 
continuous, binary, and mixed data using many different measures of 
distance. 


28.7 Prediction application 


We compare various methods of prediction using the chapter 3 data on 
natural logarithm of health expenditures. Several of the methods illustrated 
employ with little explanation community-contributed programs whose use 
requires additional reading. 


28.7.1 Training and holdout samples 


The sample is split into two parts. A training sample is used to select and 
estimate the preferred predictor within a class of predictors, such as lasso. A 
test sample or hold-out sample of the remaining observations is then used to 
compare the out-of-sample predictive ability of these various predictors. 


The basic variables are 5 continuous variables and 14 binary variables. 
From these, we can create 188 interacted variables—20 from continuous 
variables and their own second-order interactions, 28 from the 14 binary 
variables, and 140 from the interactions of the binary and continuous 
variables. 


. * Data for prediction example: 5 continuous and 14 binary variables 
. qui use mus203mepsmedexp, clear 


. keep if !missing(ltotexp) 
(109 observations deleted) 


. global xlist income educyr age famsze totchr 


. global dlist suppins female white hisp marry northe mwest south 
> msa phylim actlim injury priolist hvgg 


. global rlist c.($xlist)##c.($xlist) i.($dlist) c.($xlist)#i.($dlist) 


The sample is split into a training and fitting sample that uses 80% of the 
observations (train==1) and a holdout sample of 20% of the observations 
(train==0). 


. splitsample ltotexp, generate(train) split(1 4) values(O 1) rseed(10101) 
. tabulate train 


train Freq. Percent Cum. 
0 591 20.00 20.00 
1 2,364 80.00 100.00 


Total 2,955 100.00 


Note that here the sample is completely observed because a preceding 
command dropped observations with missing values of 1totexp, the only 
variable with missing values. So including a list of variables such as 1totexp 
in splitcommand is actually unnecessary. 


28.7.2 Various predictors 


We obtain estimates using only the training sample (t rain==1). The 
subsequent predict command provides predictions for the entire sample, so 
it provides predictions for both train==1 and train==0. 


The first predictor is obtained by OLS regression of 1totexp on the 19 
basic variables using the training sample. Most variables are statistically 
significant at the 5% level. 


. * OLS with 19 regressors 


. regress ltotexp $xlist $dlist if train==1, noheader vce(robust) 
Robust 

ltotexp | Coefficient std. err. t P>|t | [95% conf. interval] 
income 0010653 0010664 1.00 0.318 -.0010259 .0031565 
educyr .0431495 0081645 5.29 0.000 .027139 .0591599 
age 0025177 .0040582 0.62 0.535 - .0054403 .0104757 
famsze - .0635828 0285771 -2.22 0.026 -.1196218 -.0075437 
totchr . 3220218 .0208646 15.43 0.000 . 2811068 . 3629368 
suppins . 1547863 .0523682 2.96 0.003 .0520934 2574791 
female - .0643839 052321 -1.23 0.219 -. 1669842 .0382164 
white .1773761 . 1474569 1.20 0.229 -.1117833 . 4665356 
hisp -.1031283 . 1030525 -1.00 0.317 -.3052118 .0989552 
marry . 1491644 .0571793 2.61 0.009 .0370372 .2612917 
northe . 2805731 .0794206 3.53 0.000 . 1248312 .436315 
mwest . 3296948 .0760097 4.34 0.000 . 1806417 .478748 
south . 1997139 .0670176 2.98 0.003 . 068294 . 3311338 
msa .0677191 .0572256 1.18 0.237 -.044499 . 1799372 
phylim .2661041 .0627222 4.24 0.000 . 1431074 . 3891008 
actlim . 39576 .0698797 5.66 0.000 . 2587277 . 5327924 
injury . 1305469 .0607895 2.15 0.032 .0113402 . 2497537 
priolist . 3835745 .077633 4.94 0.000 . 2313381 .535811 
hvgg - .0965534 0505962 -1.91 0.056 -.1957713 . 0026646 
_cons 5.823748 . 3754025 15.51 0.000 5.087593 6.559903 


. qui predict y_small 


A second predictor fits an OLS regression on the full set of 188 interacted 
variables. 


* OLS with 188 potential regressors and 104 estimated 
qui regress ltotexp $rlist if train==1 


qui predict y_full 


From suppressed output, the coefficients of 104 of the 188 variables are 
identified. 


The third and fourth predictors we consider are penalized and 
postselection estimates of the coefficients from a lasso with tuning parameter 
à determined by adaptive 10-fold cv. Note that for each of the 10 folds, this 
uses 90% of the 2,364 observations in the training sample for fitting and the 
remaining 10% for determining the best value of A based on predictive 
ability. 


* LASSO with 188 potential regressors leads to 32 selected 
. qui lasso linear ltotexp $rlist if train==1, selection(adaptive) 
> rseed(10101) nolog 


lassoknots 
No. of CV mean 
nonzero pred. Variables (A)dded, (R)emoved, 

ID lambda coef. error or left (U)nchanged 

51 17 . 76327 1 1.76889 | A totchr 

59 | 8.438993 2 1.55272 | A O.actlim 

66 | 4.400098 3 1.473914 | A 1.priolist#c.educyr 

71 2.76339 4 1.430335 | A O.phylim#c.famsze 

73 | 2.294215 6 1.415289 | A 1.marry#c.educyr 
1.suppins#c.age 

78 | 1.440834 7 1.386716 O.hvgg#c.totchr 

80 | 1.196205 9 1.380092 1.mwest#c.totchr 
1.injury#c.educyr 

84 . 824498 10 1.369338 | A 1.mwest#c.famsze 

85 . 7512519 11 1.367485 | A O.female#c.totchr 

87 - 6237025 12 1.364392 | A O.priolist#c.totchr 

89 . 5178088 13 1.361144 | A O.marry#c.totchr 

90 -4718081 14 1.359738 | A 1.northe#c.educyr 

91 . 4298939 15 1.35839 | A O.actlim#c.totchr 

92 . 3917033 16 1.356668 | A O.priolist#c.famsze 

95 . 2963092 17 1.352067 | A 1.south#c.educyr 

96 . 2699859 18 1.350489 | A O.white#c.famsze 

99 . 2042345 20 1.346719 A 1.female#c.income 
1.phylim#c.educyr 

100 . 1860908 21 1.346044 A O.actlim#c.famsze 

101 . 169559 23 1.345632 A 1.actlim#c.famsze 
1.northe#c.totchr 

103 . 1407709 25 1.344879 | A 0.south#c.famsze 
O.injury#c.totchr 

104 . 1282652 26 1.344431 A O.suppins#c. income 

105 . 1168705 27 1.344094 | A 1.hvgg#c.educyr 

106 . 106488 28 1.343763 | A O.priolist 

107 .0970279 29 1.343447 | A 1.hisp#c.income 

108 . 0884082 30 1.343113 | A O.suppins#c.totchr 

110 .0733981 31 1.342763 | A 1.mwest#c.income 

112 . 0609364 32 1.341704 | A 1.msa#c.educyr 

* 120 . 0289497 32 1.339496 | U 

121 .0263779 33 1.339525 | A 1.hvgg#c.famsze 

128 0137535 34 1.340677 | A O.suppins#c.famsze 

130 0114184 35 1.341051 | A 1.actlim#c.income 

132 . 0094797 36 1.341602 | A O.msa 

135 0071711 38 1.342449 | A 1.hisp#c.age 
1.south#c.totchr 

136 . 006534 37 1.342685 | R 1.msa#c.educyr 

138 . 0054246 36 1.343111 | R O.actlim#c.famsze 

139 . 0049427 37 1.343368 | A 1.mwest#c.age 

143 . 0034068 39 1.34451 A 1.msa#c.educyr 
1.northe#c.income 

144 . 0031042 38 1.344751 | R totchr 

145 . 0028284 39 1.344983 | A O.actlim#c.famsze 

147 . 0023482 40 1.345372 | A totchr 

149 .0019495 40 1.345694 | U 


* lambda selected by cross-validation in final adaptive step. 
. qui predict y_laspen // Use penalized coefficients 


. qui predict y_laspost, postselection // Use post selection OLS coeffs 


The adaptive lasso leads to 32 selected variables, and many are interactions 
such as 0.actlim#c.totchr, the number of chronic conditions for 
individuals without an activity limitation. We then calculate two predictors. 
The first uses the penalized coefficients that are the lasso coefficient 
estimates. The second uses coefficients obtained by OLS regression on the 32 
regressors selected by the lasso. 


By comparison, the option selection (cv) leads to 42 selected variables, 
and the option selection (bic) leads to 17 selected variables. 


As a clustered example, for illustrative purposes, we add the option 
cluster (age). Then Cv selects 46 variables, adaptive lasso selects 34 
variables, and BIC selects 0 variables. 


A fifth predictor regresses 1totexp on the first 5 principal components of 
the 19 underlying variables. 


. * Principal components using the first 5 principal components of 19 variables 
. qui pca $xlist $dlist if train== 


. qui predict pc* 
. qui regress ltotexp pci-pc5 if train== 


. qui predict y_pca 


A sixth predictor is a neural network on the 19 underlying variables, 
computed using the community-contributed brain command (Doherr 2018). 
The option hidden (10 10), for example, specifies 2 hidden layers with 10 
hidden units in each layer. We fit a neural network with just 1 hidden layer 
and 10 units in that layer. Using program defaults, we obtain 


. * Neural network with 19 variables and 2 hidden layers each with 10 units 
. brain define, input($xlist $dlist) output(ltotexp) hidden(10) 
Defined matrices: 
input [4,19] 
output [4,1] 
neuron[1,30] 
layer [1,3] 
brain[1,211] 


. qui brain train if train==1, iter(500) eta(2) 


. brain think y_neural 


A seventh predictor is a random forest. The community-contributed 
rforest command (Schonlau and Zou 2020) estimates random forests for 
regression and classification. The type (reg) option specifies that the tree is 
for regression and not for classification; the depth (10) option limits the 
depth of the tree to be no more than 10; and the 1size(5) option sets the 
minimum number of observations per leaf to 5. Other options are set at their 
default values. This includes setting numvars (), the number of variables 
randomly selected for each tree, to the square root of the number of 
predictors, here 5 because it is the first integer to exceed \/19. 


. * Random forest with 19 variables 
. qui rforest ltotexp $xlist $dlist if train==1, 
> type(reg) iter(200) depth(10) lsize(5) 


. qui predict y_ranfor 


Finally, the community-contributed boost command (Schonlau 2005) 
accommodates boosting and bagging regression trees for linear, logistic, and 
Poisson regression. The command is a C++ plugin that must first be loaded 
into Stata. For details on the method and program, see Schonlau (2005). 


We use the program defaults for boosted regression trees. 


. * Boosting linear regression with 19 variables 
. program boost_plugin, plugin using("C:\ado\personal\boost64.d11") 


. qui boost ltotexp $xlist $dlist if train==1, 
> distribution(normal) trainfraction(0.8) maxiter(100) predict (y_boost) 


28.7.3 Comparison of predictors 


We compare the prediction performance of the various predictors both in 
sample and out of sample. 


. * Training MSE and test MSE for the various methods 
. qui regress ltotexp 


. qui predict y_noreg 


. foreach var of varlist y_noreg y_small y_full y_laspen y_laspost y_pca 
> y_neural y_ranfor y_boost { 


2. qui gen “var“errorsg = (`var” - ltotexp)~2 

3. qui sum ~var’errorsq if train == 

4. scalar mse var “train = r(mean) 

5. qui sum “var’errorsq if train == 0 

6. qui scalar mse`var’test = r(mean) 

7. display "Predictor: " "“var~" _col(21) 
> "Train MSE = " %45.3f mse’var°train " Test MSE = " %5.3f mse var “test 

8. } 
Predictor: y_noreg Train MSE = 1.821 Test MSE = 2.063 
Predictor: y_small Train MSE = 1.339 Test MSE = 1.492 
Predictor: y_full Train MSE = 1.262 Test MSE = 1.509 
Predictor: y_laspen Train MSE = 1.298 Test MSE = 1.491 
Predictor: y_laspost Train MSE = 1.297 Test MSE = 1.493 
Predictor: y_pca Train MSE = 1.397 Test MSE = 1.545 
Predictor: y_neural Train MSE = 1.211 Test MSE = 1.808 
Predictor: y_ranfor Train MSE = 1.047 Test MSE = 1.574 
Predictor: y_boost Train MSE = 1.459 Test MSE = 1.664 


The in-sample MSE is smallest for the most flexible models, notably, 
random forests and neural networks. 


By comparison, the out-of-sample MSE in this example is lowest for the 
simpler models, notably, OLS with just 19 regressors and the lasso estimators. 


The results here for neural networks, random forest, and boosting are 
based mainly on use of default options. More careful determination of tuning 
parameters for these methods could be expected to improve their predictive 
ability. 


28.8 Machine learning for inference in partial linear model 


Machine learning methods can lead to better prediction than the regression 
methods historically employed in applied microeconometrics. But much 
microeconometric research is instead aimed at estimating the partial effect of 
a single variable, or estimation of one or a few parameters, after controlling 
for the effect of many other variables. 


Machine learning methods have the potential to control for these other 
nuisance variables, but any consequent statistical inference on the 
parameters or partial effects of interest needs to control for the data mining 
of machine learning. As noted in section 28.3.4, we cannot directly perform 
inference on lasso and related estimators; we cannot naively use the lasso. 


Instead, a semiparametric approach is taken where a model depends in 
part on parameters of interest and in part on “nuisance” functions of other 
variables. If estimation is based on a moment condition that satisfies an 
orthogonalization property defined in section 28.8.8, then inference on the 
parameters of interest may be possible. 


The leading example to date is the estimation of parameters in a partial 
linear model, using lasso under the sparsity assumption that only a few of the 
many potential control variables are relevant. We focus on this case and the 
associated Stata commands introduced in version 16. 


28.8.1 Partial effects in the partial linear model 


We consider a setting where interest lies in measuring the partial effect on y 
of a change in variables q, controlling for additional control variables. 


A partial linear model for linear regression specifies 


y=da+ g(x.) +u 


where xe denotes selected control variables and g(-) is a flexible function of 
Xc. The parameter œ can be given a causal interpretation with the selection- 

on-observables-only assumption that E(u|d, x.) = 0. The goal is to obtain a 
root- Ņ consistent and asymptotically normal estimator of the partial effect a 


The partial linear model was introduced in section 27.6. There g(-) was 
unspecified, and estimation was by semiparametric methods that required 
that there be few controls Xe to avoid the curse of dimensionality. 


Lasso methods due to Belloni, Chernozhukov, and Hansen (2014) and 
related articles instead allow for complexity in g(-) by specifying 
g(Xc) ~ x'y + r, where x consists of X< and flexible transformations of x< 
such as polynomials and interactions and r is an approximation error. The 
starting point is then that 


y=dWat+xy+rtu (28.4) 


The lasso is used in creative ways, detailed in this section, to select a 
subset of the few variables in the high-dimensional x and to construct 
regressors in a consequent regression that yields an estimate of a that is 
root- NV consistent and asymptotically normal, despite the data mining. 


The estimate of a is often called a causal estimate because if the model 
is well specified with a good set of controls, then an assumption of selection 
on observables only may be reasonable. But the method is also applicable 
without a causal interpretation. 


A key assumption, called the sparsity assumption, is that only a small 
fraction of the x variables are relevant. Let p be the number of potential 
control variables (x) and s be the number of variables in the true model, 
where p and s may grow with N, though at rates considerably less than N. 
The precise sparsity assumption varies with the model and estimation 
method. For the partialing-out estimator, for example, the sparsity 
assumption is that s/(/N/Inp) is small. Additionally, the approximation 


error r is assumed to satisfy V(/N) pare r2 <c,/(s/N) for some c > 0. 


a 


In this section, we present Stata commands that yield three different 
lasso-based estimators of a. 


28.8.2 Partial linear model application 


We consider the same example as in section 28.7, with the change that we 
are interested in estimating the partial effect of having supplementary health 
insurance, so dq in (28.4) is the single binary variable suppins. 


. * Data for inference on suppins example: 5 continuous and 13 binary variables 
. qui use mus203mepsmedexp, clear 


. keep if ltotexp !=. 
(109 observations deleted) 


. global xlist2 income educyr age famsze totchr 


. global dlist2 female white hisp marry northe mwest south 
> msa phylim actlim injury priolist hvgg 


. global rlist2 c.($xlist2)##c.($xlist2) i.($dlist2) c.($xlist2)#i. ($dlist2) 


For later comparison, we fit by OLS models without and with all the 
interactions terms. 


* OLS on small model and full model 
. qui regress ltotexp suppins $xlist2 $dlist2, vce(robust) 


. estimates store OLSSMALL 

. qui regress ltotexp suppins $rlist2, vce(robust) 

. estimates store OLSFULL 

. estimates table OLSSMALL OLSFULL, keep(suppins) b(/9.4f) se stats(N df_m r2) 


Variable | OLSSMALL OLSFULL 
suppins 0.1706 0.1868 
0.0469 0.0478 

N 2955 2955 

df_m 19.0000 99.0000 

r2 0.2682 0.3028 


Legend: b/se 


In this example, there is little change in the coefficient of suppins going 
from a model with 19 regressors to a model with 99 identified regressors, 
and the gain in R2 is modest. Supplementary insurance is associated with a 


17%-19% increase in health spending, and the estimates are highly 
statistically significant. 


28.8.3 Partialing-out estimator 


The partialing-out method is obtained in several steps. First, for scalar 
regressor d, perform a lasso of d on x, and obtain a residual uq from OLS 
regression of q on the selected variables. Second, perform a lasso of y on x, 
and obtain a residual Uy from OLS regression of y on the selected variables. 
Finally, obtain g by OLS regression of Uy on ua. 


More generally, if there are kK key regressors of interest, then perform K 
separate lassos of each d; on x and Kg subsequent OLS regressions on the 
selected variables to obtain g separate residuals. The estimates @1,...,ax 
are obtained by OLS regression of Uy on all K residuals. 


The partialing-out method is qualitatively similar to the Robinson 
differencing estimator presented in section 27.6, which instead used 
residuals from kernel regression. The method here requires a sparsity 
assumption that the number of nonzero coefficients in the true model is 
small relative to the sample size Ņ and grows at a rate no more than N. 
More precisely, s /(./N/ ln p) should be small, where p is the number of 
potential control variables and s is the number of variables in the true model. 


The preceding algorithm for the partialing-out estimator does not extend 
to nonlinear models. Instead, the semi option provides a variation, termed 
“semipartialing-out”. First, ua is obtained as for partialing out. Second, 
perform a lasso of y on d and x, and obtain a residual Uy from OLS regression 
of y on d and the selected x variables. Finally, obtain q by Iv regression of 
Uy on d with “instruments” ua. For further details, see in the Methods and 
formulas section of [LASSO] poregress. 


28.8.4 The poregress command and related commands 


Partialing-out lasso estimates can be obtained using the poregress 
command, which has syntax 


poregress depvar varsofinterest [ of | [in], controls([ (alwaysvars) | othervars) [ options | 


The control variables are specified with the option controls ([ (alwaysvars) | 
othervars), where alwaysvars are controls to always be included and 
othervars are variables selected or deselected by the lasso. The penalty ) can 
be determined by a plugin formula (the default option selection (plugin) ), 
by Cv (option selection (cv) ), by adaptive cv (option 

selection (adaptive) ), or by BIC (option selection (bic) ). For CV 
methods, the rseed(#) option should be used. 


The vce (cluster clustvar) option provides cluster—robust standard 
errors. Additionally, if the selection (cv) Of selection (adaptive) option 
is used, then the cv determines folds at the cluster level, and the lasso 
objective functions use the average of within-cluster averages. 


The following related commands have similar syntax though use 
different algorithms: xporegress and dsregress for the linear model; 
pologit, xpologit, and dslogit for binary outcomes; popoisson, 
xpopoisson, and dspoisson for count data; and poivregress and 
xpoivregress for Iv estimation in the linear model. 


The lassoi nfo, lassoknots, cvplot, bicplot, lassocoef, and 
coefpath commands provide additional information on the fitted models. 


28.8.5 Plugin penalty parameter 


The default is to use the plugin formula for A, developed for use of the lasso 
in the current inference setting, rather than for prediction. The plugin value 
of \ leads to selection of fewer variables than the other methods. A good 
exposition of the plugin formula, and relevant references, is given in Ahrens, 
Hansen, and Schaffer (2018). 


For the linear model with independent heteroskedastic errors, Stata sets 
the penalty parameter A = c N®(1 — {7/(2p)}), where c = 1.1 and 
y = 0.1/ In{max(p, N)}. Several studies find that these are good values for 
c and 7. The individual loadings for each regressor are 
kj = 1/(1/N) DACA , where X; has been normalized to have mean 0 
and variance | and &; is a residual obtained by a sequence of first-stage 
lassos that is detailed in the Stata documentation. These settings are based on 


linear regression with heteroskedastic errors. The option 
selection(plugin, homoskedastic) is used for homoskedastic errors. 


If the option vce (cluster clustvar) is used, then the plugin value is 
the same as for heteroskedastic errors. 


28.8.6 Partialing-out application 


In the current application, the partialing-out lasso selects 21 variables and 
yields coefficient and standard error of suppins quite similar to the OLS 
results. 


* Partialing-out partial linear model using default plugin lambda 
. poregress ltotexp suppins, controls($rlist2) 


Estimating lasso for ltotexp using plugin 
Estimating lasso for suppins using plugin 


Partialing-out linear model Number of obs = 2,955 
Number of controls = 176 

Number of selected controls = 21 

Wald chi2(1) = 15.43 

Prob > chi2 = 0.0001 

Robust 

ltotexp | Coefficient std. err. z P>Izl [95% conf. interval] 
suppins . 1839193 . 0468223 3.93 0.000 .0921493 . 2756892 


Note: Chi-squared test is a Wald test of the coefficients of the variables 
of interest jointly equal to zero. Lassos select controls for model 
estimation. Type lassoinfo to see number of selected variables in each 
lasso. 


The lassoinfo command lists the number of variables selected by the 
separate lassos for the dependent variable and the single regressor of interest. 


x Lasso information 
lassoinfo 


Estimate: active 
Command: poregress 


No. of 

Selection selected 

Variable Model method lambda variables 
ltotexp linear plugin . 080387 12 
suppins linear plugin . 080387 9 


In total, 21 variables were selected, and this exactly equals the sum of 
variables selected by the two distinct lassos (12 + 9 = 21). So this example 
is unusual in that the two sets of selected variables are disjoint. 


Because the lasso is applied to more than one variable, several of the 
postestimation commands need to name the variable of interest. For the lasso 
for the dependent variable, relevant commands are lassoknots, 


for (ltotexp); lassocoef (. 


, for(ltotexp)); bicplot, 


for (ltotexp); 


cvplot, for (ltotexp); and coefpath, for (ltotexp). The last two 
commands are applicable only if selection is by cv. For variable suppins, 


use instead for (suppins). 


The partialing-out estimated coefficient of suppins can also be obtained 
by manually performing each step of the algorithm. We have 


. x Partialing out done manu 


. qui lasso linear suppins $ 
. qui predict suppins_lasso, 
. qui generate u_suppins = 
. qui lasso linear ltotexp $ 
. qui predict ltotexp_lasso, 


. qui generate u_ltotexp = 1l 


ally 
rlist2, selection(plugin) 


postselection 


suppins - suppins_lasso 


rlist2, selection(plugin) 
postselection 


totexp - ltotexp_lasso 


. regress u_ltotexp u_suppins, vce(robust) noconstant noheader 


u_ltotexp | Coefficient 


. 1839193 


u_suppins 


Robust 
std. err. t P>|t| 
.0468223 3.93 0.000 


[95% conf. 


0921117 


interval] 


. 2757268 


The postselection option of the lasso predict postestimation command is 
used here because it predicts by OLS regression on the variables selected by 
the preceding lasso command. 


28.8.7 Clustered errors application 


As an example of clustered data, we refit the previous model with the 
vce(cluster age) option. Using the default plugin method to determine the 
penalty, we obtain 


. * Cluster-robust partialing-out partial linear model using default plugin lambda 
. poregress ltotexp suppins, controls($rlist2) vce(cluster age) 


Estimating lasso for ltotexp using plugin 
Estimating lasso for suppins using plugin 


Partialing-out linear model Number of obs = 2,955 
Number of controls = 176 
Number of selected controls = 15 
Wald chi2(1) = 8.57 
Prob > chi2 = 0.0034 
(Std. err. adjusted for 26 clusters in age) 
Robust 
ltotexp | Coefficient std. err. z P>|z| [95% conf. interval] 
suppins . 1686531 .0576049 2.93 0.003 .0557496 . 2815566 


Note: Chi-squared test is a Wald test of the coefficients of the variables 
of interest jointly equal to zero. Lassos select controls for model 
estimation. Type lassoinfo to see number of selected variables in each 
lasso. 

Note: Lassos are performed accounting for clusters in age. 


The estimated coefficient is then 0.1687 with cluster—robust standard error 
0.0576. 


. * Lasso information 
. lassoinfo 


Estimate: active 
Command: poregress 


No. of 

Selection selected 

Variable Model method lambda variables 
ltotexp linear plugin .8343593 8 
suppins linear plugin .8343593 7 


In total, 15 variables were selected, with 8 of these variables coinciding with 
the 21 selected in the preceding independence case; 60 variables were 
selected using Cv; and 41 were selected using adaptive Cv. 


28.8.8 Orthogonalization 


The partialing-out estimator of a in the partial linear model is a two-step 
estimator. Unlike many two-step estimators, the asymptotic distribution of 
the second-step estimator of œ is not changed by the first-step estimation. 
This happens because the second-step estimation is based on a moment 
condition that satisfies a special orthogonalization condition. 


Define @ as parameters of interest and N as nuisance parameters, and 


consider a two-step estimator that at the first step estimates 77 and at the 
second step estimates & by solving 


So v(wi, a,n) =0 
i=1 


where w; denotes all variables. Then the asymptotic distribution of & is 
unaffected by first-step estimation of n if the function y(-) satisfies the 
orthogonalization condition that 


E{Ow(w;,c,n)/On} = 0 (28.5) 


See Cameron and Trivedi (2005, 201) or Wooldridge (2010, 410). The 
intuition is that if changing N does not in expectation change 7(-), then noise 
in 7 will not affect the distribution of q, at least asymptotically. 


Now consider the partial linear model y = ad + g(x) + u, and define 
nı = E(d|x) and nz = E(y|x). Estimates 77, (and 72) are obtained by OLs 
regression of dq (and y) on the components of x selected by the lasso. The 
partialing-out estimator of a is then obtained by OLS regression of (y — 72) 
on (d — 7). This corresponds to solving the population moment condition 
E{y(w, QO, nı, m2) } = 0, where 

w(w,a,m,n2) = (d — mH (y — n) — ald — m)} . To see this, recall that 
the OLS estimator for regression of y on scalar x solves 

>>, riui = >>; xilyi — Bx) = 0 with corresponding population moment 
condition E{x(y — Bx)} = 0. 


The orthogonalization condition (28.5) is satisfied because 


E{dw(w, a,m,n2)/Om} = E{—(y — ne) + 2a(d —m)} 
E{Oy(w, a,m,2)/On2} = E{—-(d—m)} 


and these expectations equal zero using 7; = E(d|x) and 72 = E(y|x). 


The orthogonalization result is extraordinarily powerful. The two-step 
partialing-out approach can in principle be applied for any two-step 
estimator satisfying the orthogonalization condition, also called Neyman 
orthogonalization, and in principle, the first step may use machine learners 
other than the lasso. 


28.8.9 Cross-fit partialing-out estimator 


The cross-fit partialing-out estimator is an adaptation of the partialing-out 
method that reduces bias by separating the sample used for lasso predictions 
of y and the components of q from the sample used for subsequent 
estimation of a. The combination of an orthogonalized moment and cross- 
fitting is called double machine learning or debiased machine learning and 
leads to methods requiring weaker assumptions. 


The sample is split into a larger part for the estimation of nuisance 
components and a smaller part for estimation of the parameters of interest. 
For simplicity, consider scalar d and a. The larger sample is used for lasso of 
components of d on x (and y on x) and for subsequent postselection 
OLS regression of d (and y) on the selected variables that yields predictions 
ale a (and y = x Ty): The smaller eal is then used to compute 
residuals tig = d — x'%q and u, = y — X'T y, and subsequent OLS regression 
of Uy on wg yields estimate q. 


This method reduces the complications of data mining by using one 
sample to obtain the coefficients for 7, and T, and using a separate sample 
for estimating q. Such sample splitting leads to a more relaxed sparsity 
assumption that the number of nonzero coefficients grows at a rate no more 
than N rather than ,/7V. More precisely, s/(N/1np) should be small, where 
Pp is the number of potential control variables and s is the number or 
variables in the true model. 


The preceding algorithm leads to efficiency loss because only part of the 
original sample is used at the second step to estimate a. So -fold cross 
fitting is used. For clarity, set K = 10. Then, for each k = 1,...,10, we 
obtain estimates 74,, and Ty. using 90% of the data and apply these 
estimates to the remaining 10% of the sample to form residuals ug ;, and ty, k 
. This yields 10 sets of residuals, each using 10% of the sample. The default 
method for the xporegress command stacks these residuals to form N 
residuals %4 and wu, for the full sample; subsequent OLs regression of ùy on 
Ug yields estimate gq. The option technique (dm11) leads to an alternative 
estimator q = 1/10 ae Ap», Where y is obtained by regression of w, ;, on 
Ua,k In the kth fold. 


The xporegress command for cross-fit partialing-out has syntax similar 
to the poregress command. We obtain 


. * Cross-fit partialing-out (double/debiased) using default plugin 
. xporegress ltotexp suppins, controls($rlist2) rseed(10101) nolog 


Cross-fit partialing-out Number of obs = 2,955 
linear model Number of controls = 176 
Number of selected controls = 31 

Number of folds in cross-fit = 10 

Number of resamples = 1 

Wald chi2(1) 7 15.66 

Prob > chi2 = 0.0001 

Robust 
ltotexp | Coefficient std. err. z P>|z| [95% conf. interval] 


suppins . 1856171 . 0469096 3.96 0.000 .093676 . 2775582 


Note: Chi-squared test is a Wald test of the coefficients of the variables 
of interest jointly equal to zero. Lassos select controls for model 
estimation. Type lassoinfo to see number of selected variables in each 
lasso. 


Across the 10 folds, the number of selected variables ranged from 11 to 
14 for Ltotexp and from 7 to 11 for suppins. 


* Summarize the number of selected variables across the ten folds 
lassoinfo 


Estimate: active 
Command: xporegress 


No. of selected variables 


Selection 
Variable Model method min median max 
ltotexp linear plugin 11 13 14 
suppins linear plugin T 9 11 


28.8.10 Double-selection estimator 


The double-selection method performs a lasso of y on x and separate lassos 
of each component of d on x. Then & is the coefficient of d from OLS 
regression of y on d and the union of all components of x selected by the 
various lassos. 


The method has the advantage of simplicity. It requires a sparsity 
assumption similar to that for the partialing-out estimator, and it is 


asymptotically equivalent to the partialing-out estimator. 


The dsregress command for double-selection estimation has syntax 
similar to the xporegress command. We obtain 


* Double-selection partial linear model using default plugin 
. dsregress ltotexp suppins, controls($rlist2) 


Estimating lasso for ltotexp using plugin 
Estimating lasso for suppins using plugin 


Double-selection linear model Number of obs = 2,955 
Number of controls = 176 

Number of selected controls = 21 

Wald chi2(1) = 15.30 

Prob > chi2 = 0.0001 

Robust 

ltotexp | Coefficient std. err. Zz P>|z| [95% conf. interval] 
suppins . 1836224 .0469429 3.91 0.000 .091616 . 2756289 


Note: Chi-squared test is a Wald test of the coefficients of the variables 
of interest jointly equal to zero. Lassos select controls for model 
estimation. Type lassoinfo to see number of selected variables in each 
lasso. 


The coefficient of 0.1836 is very close to the partialing-out estimate of 
0.1839. 


28.9 Machine learning for inference in other models 


Belloni, Chernozhukov, and Wei (2016) extend the methods for the partial 
linear model to a generalized partial linear model, in which case (28.4) 
becomes 


E(y|d,x) = f(d'a + x’) 


where the function f(-) is specified. Now the partial effect of a change in d 
is more complicated, being a x f’(d’a + x’y). Partial effects in this single 
index model can be interpreted as in section 13.7.3. In the special case that 
f(-) = exp(.), the partial effects can be interpreted as semielasticities, and 
for a logit model with f(z) = e*/(1 + e”), the partial effects can be 
interpreted in terms of the log-odds ratio; see section 10.5. 


In this section, we illustrate these extensions. Additionally, we present 
extension of the partial linear model to the case where the regressors d are 
endogenous. 


28.9.1 Estimators for exponential conditional mean models 


An exponential conditional mean variant of the partial linear model specifies 
E(y|d, x) = exp(d’a@ + x’¥). Because only œ is consistently estimated, we 
cannot compute a marginal effect of a component of d changing, but, given 
the exponential conditional mean specification, we can interpret each 
component of q as a semielasticity; see section 13.7.3. 


Commands popoisson, xpopoisson, and dspoisson have syntax similar 
to command poregress. The method is detailed in Methods and formulas in 
[LASSO] popoisson. 


These commands are applicable to any nonnegative dependent variable 
with exponential conditional mean and are not restricted to count data. So 
we could apply this method with dependent variable totexp, the level of 
health expenditures. 


Nonetheless, we illustrate the method for a count, converting totexp toa 
count that takes values between 0 and 15. We compare standard Poisson 
regression estimates with the partialing-out estimate. We obtain 


. * Exponential variant of partial linear model and partialing-out estimator 
. generate ycount = floor(sqrt(totexp/500) ) 


. Summarize ycount 


Variable Obs Mean Std. dev. Min Max 


ycount 2,955 2.633841 2.202957 0 15 
. qui poisson ycount suppins $xlist2 $dlist2, vce(robust) 
. estimates store PSMALL 
. qui poisson ycount suppins $rlist2, vce(robust) 
. estimates store PFULL 
. qui popoisson ycount suppins, controls($rlist2) coef 
. estimates store PPOLASSO 
. estimates table PSMALL PFULL PPOLASSO, keep(suppins) b(%9.4f) se 


> stats(N df_m k_controls_sel) 
Variable PSMALL PFULL PPOLASSO 
suppins 0.0602 0.0645 0.0666 
0.0298 0.0299 0.0304 
N 2955 2955 2955 
df_m 19.0000 99.0000 
k_controls~1 22.0000 


Legend: b/se 


The partialing-out estimator with the option coef yields @ = 0.0666, SO 
private insurance is associated with 6.66% higher outcome y. The default is 
to instead report exponentiated coefficients such as exp(0.0666) = 1.0689, 
in which case y is viewed as 1.0689 times higher. 


28.9.2 Estimators for the logit model 


A logistic variant of the partial linear model specifies 

E(y|d,x) = A(d’a + x’y), where A(z) = e*/(1 + e7). Because only a is 
consistently estimated, we cannot compute a marginal effect of a component 
of d changing, but, given the logistic specification, we can interpret each 


component of @ as the impact on the log-odds ratio or, equivalently, each 
exponentiated component of a as the impact on the odds ratio; see 
section 10.5. 


Commands pologit, xpologit, and dslogit have syntax similar to 
command poregress. For details, see Methods and formulas in 
[LASSO] pologit. 


These commands are applicable to any dependent variable that takes a 
value between 0 and 1. To illustrate the method for a binary variable, we 
convert totexp to a variable that takes value 1 if expenditures exceed 
$4,000. The standard logit regression estimates for the odds ratio (option or) 
are compared with the partialing-out estimate. We obtain 


. * Logit variant of partial linear model and partialing-out estimator 
. generate dy = totexp > 4000 


. tabulate dy 


Freq. Percent Cum. 
1,661 56.21 56.21 
1,294 43.79 100.00 
2,955 100.00 


. qui logit dy suppins $xlist2 $dlist2, or vce(robust) 

. estimates store LSMALL 

. gui logit dy suppins $rlist2, or vce(robust) 

. estimates store LFULL 

. qui pologit dy suppins, controls($rlist2) coef 

. estimates store LPOLLASSO 

. estimates table LSMALL LFULL LPOLLASSO, keep(suppins) b(%9.4f) se 


> stats(N df_m k_controls_sel) 
Variable LSMALL LFULL LPOLLASSO 
suppins 0.2498 0.2792 0.2632 
0.0898 0.0936 0.0892 
N 2955 2955 2955 
df_m 19.0000 99.0000 
k_controls~1 19.0000 


Legend: b/se 


The partialing-out estimator with the default option or yields estimate 
exp(@) = 0.2632, so the odds ratio Pr(y = 1|d, x)/ Pr(y = Old, x) of 
having medical expenditures exceeding $4,000 is 26.3% higher for those 
with supplementary insurance compared with those without supplementary 
insurance. 


28.9.3 Partialing-out for IV estimation 


For IV estimation, the most efficient estimator uses all available instruments 
according to standard asymptotic theory. But in practice this asymptotic 
theory can fail in typical sample sizes when there are too many instruments. 
This many-instruments problem can arise when a model is considerably 
overidentified, with many more instruments than endogenous regressors. But 
it can also arise in a just-identified model if there are many controls that lead 
to a low first-stage F statistic because the marginal contribution of the 
instruments becomes slight after inclusion of the many controls. 


Chernozhukov, Hansen, and Spindler (2015) extend the partialing-out 
estimator to Iv estimation of the linear model with selection of a subset of 
control variables from many controls or a subset of instruments from many 


instruments, or both. 


The poivregress command provides partialing-out estimates of œ and 6 
in the model 


y = d'a +w’ +x y +v 


where d are endogenous variables, w are exogenous variables to always be 
included, and x are exogenous control variables that may potentially be 
included. Additionally, there are instruments z with dim|z] > dim|d]. 


The poivregress command has syntax 


poivregress depvar | exovars | (endovars=instrumvars) [ of | lin | ; 
controls( [ (alwaysvars) ] othervars) [ options | 


For simplicity, consider the case of scalar endogenous regressor d and 
6 = 0. The partialing-out algorithm is the following. 


1. Calculate a partialed-out independent variable as the residual u,,; from 
OLS regression of y on X,, where x, denotes the selected variables from 
a lasso of y on x. 

2. Calculate a scalar instrument 7%, ; as follows. Perform a lasso of dq on x 
and z, and denote the selected variables as, respectively, x, and Zg. 
Then, obtain a prediction g from OLS regression of d on X4 and Za. 
Then, calculate the residual ŭg; and the coefficients B from OLS 
regression of q on Xç, where Xj denotes the selected variables from a 
lasso of q on x. 

3. Calculate a partialed out endogenous regressor Uqi = di — XG B. 


4. Compute & by IV regression of U,; on Gai With ŭa; as the instrument. 


As an example, we consider a variant of the analysis of Belloni 
et al. (2012) based on Acemoglu, Johnson, and Robinson (2001). In this 
example, the model is just identified, but there are only 64 observations and 
24 potential controls that could lead to a weak instrument problem. 


The goal is to use a cross-sectional sample of countries to measure the 
causal effect on per capita income (logpgp95) of protection against 
expropriation risk (avexpr). The mortality rate of early settlers (logem4) is 
used as an instrument for avexpr. The global macro x21ist includes 24 
possible control variables that include measures of country latitude, 
temperature, humidity, soil types, and natural resources. 


* Read in Acemoglu-Johnson-Robinson data and define globals 
. qui use mus228ajr, clear 


. global xlist lat_abst edes1975 avelf temp* humid* steplow deslow 
> stepmid desmid drystep drywint goldm iron silv zinc oilres landlock 


. describe logpgp95 avexpr logem4 


Variable Storage Display Value 
name type format label Variable label 
logpgp95 float {%9.0g Log PPP GDP pc in 1995, World Bank 
avexpr float 7%9.0g Average protection against 
expropriation risk 
logem4 float 29 .0g Log settler mortality 
summarize logpgp95 avexpr logem4, sep(0) 
Variable Obs Mean Std. dev. Min Max 
logpgp95 64 8.062237 1.043359 6.109248 10.21574 
avexpr 64 6.515625 1.468647 3.5 10 
logem4 64 4.657031 1.257984 2.145931 7.986165 


We use command poivregress with plugin bandwidth for the 
homoskedastic case. 


* Partialing-out IV using plugin for lambda 
. poivregress logpgp95 (avexpr=logem4), controls($xlist) selection(plugin, hom) 


Estimating lasso for logpgp95 using plugin 
Estimating lasso for avexpr using plugin 
Estimating lasso for pred(avexpr) using plugin 


Partialing-out IV linear model Number of obs = 64 
Number of controls = 24 
Number of instruments = 1 
Number of selected controls = 5 
Number of selected instruments = 1 
Wald chi2(1) = 8.74 
Prob > chi2 = 0.0031 
Robust 
logpgp95 | Coefficient std. err. z P>|z| [95% conf. interval] 
avexpr .8798503 . 2976286 2.96 0.003 . 296509 1.463192 


Endogenous: avexpr 

Note: Chi-squared test is a Wald test of the coefficients of the variables 
of interest jointly equal to zero. Lassos select controls for model 
estimation. Type lassoinfo to see number of selected variables in each 
lasso. 


Only six controls are selected. From lassocoef (., for (avexpr) ), the first 
lasso selected logem4, edes1975 and zinc; from 


lassocoef (., for (logpgp95) ), the second lasso selected edes1975 and 
avelf; and from lassocoef(.,for(pred(avexpr) ) ), the third lasso selected 
edes1975, avelf, temp2, iron, and zinc. 


From output not given, when all 24 controls are used, the regular Iv 
estimate of avexpr is 0.713 with standard error 0.147. The reason for the 
smaller standard error of regular Iv is that regular Iv used an additional 19 
controls that greatly improved model fit. At the same time, reducing the 
number of controls in the first-stage estimation led to more precise 
estimation of the first-stage coefficient of the instrument logem4. 


The following code manually implements the partialing-out Iv estimator 
for this example. 


. * poivregress estimator in just-identified model obtained manually 

. gen y = logpgp95 

. gen d = avexpr 

. global zlist logem4 

. qui lasso linear y $xlist, selection(plugin, hom) // Lasso of y on x 

. qui predict yhat, postselection 

. generate yresid = y - yhat // Generate y residual 
. qui lasso linear d $xlist $zlist, selection(plugin, hom) // Lasso d on x,z 
. qui predict dhat, postselection // Generate dhat 

. qui lasso linear dhat $xlist, selection(plugin, hom) // Lasso dhat on x 


predict dhat_hat, postselection 
(option xb assumed; linear prediction with postselection coefficients) 


. generate dhatresid = dhat - dhat_hat // Generate dhat residual 
. generate dresid = d - dhat_hat // Generate d "residual" 
. ivregress 2sls yresid (dresid = dhatresid), noconstant vce(robust) 
Instrumental variables 25LS regression Number of obs = 64 
Wald chi2(1) = 
Prob > chi2 = 
R-squared = . 
Root MSE = . 81396 
Robust 
yresid | Coefficient std. err. Zz P>lz| [95% conf. interval] 
dresid . 8798503 . 2952943 2.98 0.003 . 3010841 1.458617 


Instrumented: dresid 
Instruments: dhatresid 


The estimate equals that from poivregress, while the slight difference in the 
standard error is due to different degrees-of-freedom correction. 


Care is needed in using the poivregress command because it is possible 
that the lasso of d on x and z may lead to too few instruments being 
selected, in which case the model becomes unidentified. Indeed, this 
happened in the current example when the default heteroskedastic variant of 
the plugin value of lambda was used, because then the single instrument 
logem4 in this just-identified example was not selected. 


Such selection of too few instruments is even more likely to occur with 
the cross-fitting xpoivregress command because the variable selection then 
occurs K times. The remedy is to use a larger value of the lasso penalty A. 


28.9.4 Further discussion 


The methods illustrated have been restricted to the partial linear model and 
generalized partial linear model using lasso under the assumption of sparsity. 
These examples can be extended to the use of other machine learners and to 
application in other models. 


Farrell (2015) applies the lasso to the doubly robust augmented inverse- 
probability weighting estimator of average treatment effect (ATE) for a binary 
treatment presented in section 24.6.5. The telasso command, introduced in 
Stata 17, implements this estimator. The command syntax, similar to that for 
teffects commands, is 


telasso ipwra (ovar omvarlist Ee omodel om_options |) 
(tvar tmvarlist |, tmodel tm_options |) lif | [ in | [ weight | E stat options | 


where om_options and tm_options include options for the lasso and 
determination of the lasso penalty parameter in, respectively, the outcome 
model and the treatment model. The outcome model can be linear, logit, 
probit, or Poisson; the binary treatment model can be logit or probit; and the 
command can compute ATE, ATE on the treated, and potential-outcome 
means. For details on the implementation of the lasso, see especially the 
lasso command in section 28.4.1. The telasso command option 

vce (cluster Clustvar) provides cluster—robust standard errors. Note that in 


this case the lasso is one that gives equal weight to each cluster rather than to 
each observation. 


Farrell, Liang, and Misra (2021) establish theory suitable for use of deep 
nets for causal inference and provide an application using the augmented 
inverse-probability weighting TEs estimator. 


Many of these methods use orthogonalized moment conditions (see 
section 28.8.8) and cross-fitting (see section 28.8.9). Chernozhukov et al. 
(2018) provide an excellent overview, theory, and applications that use a 
variety of machine learners (lasso, regression tree, random forest, boosting, 
and neural network) to estimate ATE and local ATE for a binary treatment with 
heterogeneous effects and for Iv estimation in a partial linear model. The 
machine learner needs to approximate well the nuisance part of the model. 
Appropriate assumptions to ensure this will vary with the setting and will not 
necessarily require a sparsity assumption. 


This is an exceptionally active area of current econometric research, and 
we anticipate an explosion of new methods that will be implementable in 
Stata using one’s own coding as community-contributed Stata programs and, 
ultimately in some cases, as official Stata programs. 


In a separate important strand of research that is not covered here, Wager 
and Athey (2018) use random forests to estimate the ATE for a binary 
treatment for subgroups of the population and to identify groups with the 
greatest TE. They provide nonstandard asymptotic results that yield 
pointwise confidence intervals. Their method includes the use of “honest 
trees”, qualitatively similar to cross-fitting. Wager and Athey (2019) provide 
an empirical example. 


Machine learning methods for causal inference rely heavily on 
asymptotic theory, and investigation of finite-sample behavior is quite 
recent. For example, Wiithrich and Zhu (Forthcoming) find poor finite- 
sample confidence interval coverage for the partial linear model estimated 
using the double-selection lasso. They instead suggest as an alternative 
direct OLS estimation with the many controls included as regressors, and 
subsequent inference based on recently developed methods for inference 
when there are many covariates. 


28.10 Additional resources 


Machine learning methods have only recently been used in 
microeconometrics. References that are written from a statistics perspective 
include a master’s level text by James et al. (2021) and more advanced texts 
by Hastie, Tibshirani, and Friedman (2009) and Efron and Hastie (2016). 
Varian (2014) provides an early summary for economists of machine 
learning approaches and some of the software packages developed to 
handle massive datasets. Mullainathan and Spiess (2017) provide a detailed 
application of machine learning methods for prediction in economics and 
cite many applications. Athey_and Imbens (2019) provide a more recent 
overview. Hansen (2022, chap. 29) covers both machine learning methods 
and their use for causal inference. 


Initial research on use of machine learning for statistical inference has 
used the lasso; see Belloni, Chernozhukov, and Hansen (2014) for an 
accessible summary and illustration. Drukker (2020) provides a more recent 
account. Stata 16 introduced commands for lasso and elastic net for both 
prediction and statistical inference that are detailed in [LASSO] Stata Lasso 
Reference Manual. 


The lassopack package (Ahrens, Hansen, and Schaffer 2019) overlaps 
considerably with the Stata lasso commands for prediction. The 
accompanying article is well worth reading because it provides details on 
the background theory. The community-contributed pdslasso and ivlasso 
commands (Ahrens, Hansen, and Schaffer 2018) overlap considerably with 
the Stata commands for inference in the partial linear model and include 
some additional features. 


Other machine learning methods may be used, both for prediction and 
for statistical inference. The important innovation of double or debiased 
machine learning with orthogonalized moment conditions and cross fitting 
is presented in Chernozhukov et al. (2018); see also Chernozhukov, Newey, 


forests to estimate heterogeneous treatment effects. 


The community-contributed pylearn package (Droste 2020) provides 
functions that implement popular python functions for regression trees, 
random forests, neural networks, adaptive boosting, and gradient boosting. 


28.11 Exercises 


1. Suppose we have a sample of 8 observations for which y takes values 1, 2, 3, 
4,5, 6, 7, and 8. We wish to predict y using the sample mean. Compute MSE 
using all the data. Now, suppose we choose as training dataset the first, third, 
fifth, and seventh observations, and the remaining four observations are the 
test data. Compute the test MSE. Now, suppose we use four-fold cv where the 
first fold has the first and fifth observations, the second fold the second and 
sixth, the third fold the third and seventh, and the fourth fold the fourth and 
eighth. Compute MSE for each fold. Hence, compute cv4 and the standard 
error of Cv4. 

2. Repeat the analysis of section 28.4.6 with regressors x1, x2, and x3 
augmented by their products and cross products, again using rseed (10101). 
Comment on the differences between the various estimates. 


3. Generate a sample of 10,000 observations using the following code. 
set obs 10000 


set seed 10101 
matrix MU = (0,0,0) 


scalar rho = 0.95 

matrix SIGMA = (1,rho,rho \ rho,1,rho \ rho,rho,1) 
drawnorm xi x2 x3, means(MU) cov(SIGMA) 

scalar rho = 0.2 

matrix SIGMA = (1,rho,rho \ rho,1,rho \ rho,rho,1) 
drawnorm x4 x5 x6, means(MU) cov(SIGMA) 


generate y = 1 + 2*x1 + 3*x2 + 24x1*x2 + 2*x4 + 3*x5 + 2*x4*x5 + rnormal (0,10) 


The potential regressors are x1—x6 and their products and cross products. 
Perform adaptive lasso with the option rseed (101010) on the full sample, 
on the first 1,000 observations, and on the first 100 observations. Comment 
on the ability of lasso to detect the true model as the sample sizes changes. 
(This comparison is easier following the example in section 28.4.6.) 
Similarly, perform OLS of y on all potential regressors. Suppose we select 
only those regressors that are statistically significant at 5%. Comment on the 
ability of OLS to detect the true model as the sample sizes changes. (This 
comparison is simpler using the star option of estimates table.) 


4. Perform an analysis with training and test samples qualitatively similar to 
those in section 28.7 using the same generated data as in section 28.2. 
Specifically, split the data into a training sample of 30 observations and a 
test sample of 10 observations, using command 


splitsample y, generate(train) split(1 3) values(0O 1) rseed(10101) 


Use oLs and lasso with five-fold cv to obtain predictions. In the case of 
lasso, obtain predictions from both penalized coefficients and postselection 
coefficients. Which method predicts best in the training sample? Which 
method predicts best in the holdout sample? 


. Use the same data as in section 28.7, but use only the continuous regressors, 
so the regressor list is c. (S$xlist) ##c. ($xlist). Give the same 
splitsample command as in question 4 above. Using the training dataset, 
perform 10-fold cv lasso, adaptive lasso, ridge, and elasticnet regression 
with the option rseed (101010). Compare the coefficients selected and their 
penalized values, training sample fit, and test sample fit across the estimates. 

. Repeat the previous question but using an exponential conditional mean 

model rather than a log-linear model. The dependent variable is totexp, the 

level of expenditure; the sample should now include observations with 
totexp = 0; the lasso poisson and elasticnet poisson commands are 
used; and selected variables are compared with Poisson estimates with 

heteroskedastic—robust standard errors that are statistically significant at 5%. 

. Fit a partial linear model using the same generated sample as that in 

question 3. The single regressor of interest is x1, and the potential controls 

are all other variables and interactions, including interactions with x1. The 
list of potential controls can be set up using commands 


global xlist2 x2 x3 x4 x5 x6 
global xlinteract c.xic. ($xlist2) 
global rlist2 $xiinteract c.($xlist2)c. ($xlist2) 


Use command poregress with default options on the full sample, on the first 
1,000 observations, and on the first 100 observations. Compare the estimated 
coefficient of x1 with the DGP value as the sample size changes. Comment on 
the controls chosen for y and x1 as the sample size changes. Fit by OLS the 
model with all potential regressors included, and compare the coefficient of 
x1 (and its heteroskedastic—robust standard error) with the poregress 
estimates. Fit using command xporegress with default options, and compare 
with the poregress estimates. 


. Use the same data as in section 28.8, and regress ltotexp ON suppins and on 
controls that are only the continuous regressors, so the controls regressor list 
is c. ($xlist2) ##c. ($xlist2). Regress 1totexp on suppins and potential 
controls using commands poregress, xporegress, and dsregress with 


default options and by oLs. Compare the fitted coefficients and their 
heteroskedastic_robust standard errors across these methods. 


Chapter 29 
Bayesian methods: Basics 


29.1 Introduction 


Bayesian methods combine prior information on parameters with a 
likelihood function for the data-generating process. The prior information 
expresses the investigator’s uncertainty about the parameters, and the 
likelihood is a parametric expression of the conditional distribution of the 
data. The Bayesian approach provides a method of combining the two and 
is an alternative statistical inference framework to classical statistics and an 
alternative computational method for model fitting. The objective in 
Bayesian analysis is to obtain an estimate of the posterior distribution of the 
model parameters. 


Until recently, Bayesian methods were infrequently used because of 
1) analytical intractability in all but the simplest models; 2) lack of good 
prior information on parameters in many settings such as regression with 
many regressors and hence many parameters; and 3) resistance from 
classical statisticians to the Bayesian approach to inference. 


The 1980’s development of Markov chain Monte Carlo (McMc) methods 
for estimating the posterior distribution using modern computing power has 
greatly expanded the range of models that can be fit using Bayesian 
methods, so the first concern is much less of an impediment. 


The second concern, if relevant, can be mitigated by specifying prior 
information on data that is noninformative, meaning that it has little impact 
on the resulting estimates. And even if the specified prior information on 
parameters 68 is very strong, in large samples, the likelihood dominates and 
the prior has little effect. 


The third concern, if relevant, can be overcome by using 
noninformative prior information and interpreting the posterior mode, 
defined in the next section, as the maximum likelihood estimator (MLE). 


The bayes prefix makes Bayesian estimation as straightforward as using 
standard regression commands such as regress and probit. The bayesmh 
command allows estimation of a wider range of models, though requires 


specification of an appropriate density for the data and a prior distribution 
for the parameters. 


Regardless of which command is used, great care is needed in applying 
Bayesian methods. First, there is no guarantee that the MCMC iterative 
procedure has converged. There are diagnostics but there is no formal test; 
see section 29.4.4. Second, it is easy to specify an “uninformative” prior on 
a parameter that is in fact quite informative and can greatly influence the 
posterior distribution of all parameters; see section 29.6.4. Third, one can 
specify models that are nonidentified or weakly identified yet get seemingly 
sensible results; see section 29.6.5. 


In this chapter, we focus on illustrating the simplest Bayesian methods 
for the linear regression model with normal homoskedastic errors and for 
the probit binary outcome model. The subsequent chapter 30 presents 
Bayesian MCMC methods in further detail and their use as one method for 
multiple imputation of missing data. 


29.2 Bayesian introductory example 


As an introduction, before any explanation of the methods, we estimate a 
linear regression using the simplest Bayesian command, the bayes: 
regress command. The samples used in this chapter are small to speed up 
computations and because then the prior can have greater influence on 
estimates. 


29.2.1 The bayes prefix 


The bayes prefix has syntax 
bayes [ 5 bayesopts | : estimation_command [ i estopts | 


where the estimation command covers over 60 fully parametric model 
commands and the many bayesopts include those specifying the priors. 


The models covered include the leading models for linear regression (for 
example, regress, tobit); binary outcomes (logit, probit); multinomial 
outcomes (ologit, mlogit, clogit); counts (poisson, nbreg); generalized 
linear models (gim); duration models (streg); sample-selection models 
(heckman) and multilevel models (mixed, melogit, megim); and random- 
effects (RE) models (xt logit, xtpoisson, ...). 


We illustrate the bayes prefix using the bayes: regress command. 
Later sections use the more flexible bayesmh command, though most 
examples could have been implemented using the simpler bayes prefix. 


29.2.2 Bayesian estimates using default priors 


The data are based on a random sample of 100 observations for men and 
women aged 25 to 65 years who were full-time workers in 2010, extracted 
from the American Community Survey. 


. * Read in earnings - schooling data 
. qui use mus229acs 


. describe earnings lnearnings age education 


Variable Storage Display Value 
name type format label Variable label 
earnings float %9.0g Annual earnings in $ 
lnearnings float %9.0g Natural logarighm of earnings 
age int 436 .0g Age in years 
education float %9.0g Educational attainment: years of 
schooling 
. keep if _n <= 100 
(772 observations deleted) 
. summarize earnings lnearnings age education 
Variable Obs Mean Std. dev. Min Max 
earnings 100 60244 46513.19 4000 318000 
lnearnings 100 10.76058 .7273709 8.294049 12.66981 
age 100 43.33 10.9342 25 65 
education 100 13.69 3.158106 0 20 


We obtain Bayesian estimates using the bayes: regress command with 
default options. This is simply the usual command one would use for linear 
regression, prefixed by bayes. 


Because estimates are obtained using simulation methods, explained 
below, we set the seed to ensure reproducibility of results. We use the 
rseed() option of the bayes prefix or bayesmh command. In many cases, the 
set seed command could equivalently be used, but this is not the case, for 
example, if multiple chains are used because then the same seed would be 
erroneously set for each chain. 


* Bayesian linear regression with uninformative prior 
. bayes, rseed(10101): regress lnearnings education age 
Burn-in ... 
Simulation ... 


Model summary 


Likelihood: 
lnearnings ~“ regress(xb_lnearnings, {sigma2}) 
Priors: 
{lnearnings:education age _cons} ~ normal(0,10000) (1) 


{sigma2} ~ igamma(.01,.01) 


(1) Parameters are elements of the linear form xb_lnearnings. 


Bayesian linear regression MCMC iterations = 12,500 
Random-walk Metropolis-Hastings sampling Burn-in = 2,500 
MCMC sample size = 10,000 
Number of obs = 100 
Acceptance rate = .3071 
Efficiency: min = 07066 
avg = .09299 
Log marginal-likelihood = -133.37046 max = .1512 

Equal-tailed 
Mean Std. dev. MCSE Median [95% cred. interval] 

lnearnings 

education 0871874 0217776 .000819 .0868041 0471493 . 1312628 
age .008496 .0062873 .000231 0089316 -.0037933 .0208249 


_cons 9.198406 . 4482471 .016292 9.196124 8.319206 10.09851 


sigma2 . 4774248 .0711248 .001829 .4702676 . 3587335 . 6308758 


Note: Default priors are used for model parameters. 
Note: Adaptation tolerance is not met in at least one of the blocks. 


Bayesian methods combine prior information on model parameters with 
the information on these parameters obtained from the likelihood function of 
the data to obtain the posterior distribution of the parameters. 


The first set of output details the likelihood and priors. The second set of 
output gives details on the computational method used and is analogous to 
an iteration log. The MCMC iterative procedure yields 10,000 draws, called 
posterior draws, of each of the parameters. The final set of output 
summarizes the distribution of these draws. 


We consider each set of output in detail; more detailed explanation is 
given in section 29.4. 


The first set of output states that the likelihood function is the linear 
regression model with independent and identically distributed (1.1.d.) normal 
errors, the priors for the regression intercept and slope parameters are 
N(0, 1007), and the prior for the error variance is an inverse-gamma 
distribution with shape parameter 0.01 and scale parameter 0.01. The intent 
of these defaults is that the variance of these priors is so large that the priors 
are uninformative, meaning that they do not have much impact on the 
results. This need not necessarily be the case. For example, if the intercept is 
10,000, then a N(0, 1002) prior may have an impact. Going the other way, 
we may have strong prior beliefs, in which case we should provide priors 
that reflect this information. Options of the bayes prefix enable one to 
change the priors from the defaults. 


The second set of output states that 12,500 draws of the parameters were 
made, the first 2,500 were discarded (the burn-in), and the remaining 10,000 
were kept. These 10,000 draws are not independent of each other. The 
average sampling efficiency statistic of 0.09299 means that on average for 
the 4 parameters, the 10,000 correlated draws contain as much information 
as if 10000 x 0.09299 = 930 independent draws had been made. 


The final set of output summarizes the results. As expected, earnings 
increase with education and age. Consider the results for .g, the coefficient 
of education. The 10,000 posterior draws have mean 0.0872 and standard 
deviation 0.0218, and the interval between the 2.5 percentile and 97.5 
percentile of the 10,000 draws is [0.0471, 0.1313]. A purely Bayesian 
interpretation of the results is that the posterior distribution of the values that 
the parameter 3.4 may take has mean 0.0872, standard deviation 0.0218, and 
a 95% probability that G.q lies in the interval [0.0471, 0.1313]. The mcsE 
column provides the Monte Carlo standard error, defined in section 29.4.4, 
that should be small relative to the corresponding standard deviation. That is 
the case here. 


The Bayesian and maximum likelihood (ML) approaches are compared in 
some detail in section 29.3.6. In this example, the MLE for the regression 
parameters is just the ordinary least-squares (OLS) estimator. We obtain 


. * ML linear regression (same as OLS with i.i.d. errors) 
. regress lnearnings education age, noheader 


lnearnings | Coefficient Std. err. t P>|t| [95% conf. interval] 
education 0852959 0221804 3.85 0.000 0412739 . 1293178 
age 0079952 . 0064063 1.25 0.215 -.0047195 .02071 

_cons 9.246449 . 4546021 20.34 0.000 8.34419 10.14871 


The ML estimate for education is 0.0853 with standard error 0.0222 and a 
95% confidence interval for Bea of [ 0.0413, 0.1293 ]. In this example, with 
an uninformative prior, the ML coefficients, standard errors, and 95% 
confidence intervals are similar to the Bayesian posterior means, posterior 
standard deviations, and 95% credible regions. 


A key feature of Bayesian MCMC methods is that they yield an estimate of 
the distribution of parameters. To highlight this, we use the bayesgraph 
kdensity command to plot the densities of the 10,000 draws for each of the 
4 model parameters. 


. * Plot of the density of the 10,000 draws for each parameter 
. quietly bayes, rseed(10101): regress lnearnings education age 


. bayesgraph kdensity _all, combine 


Figure 29.1 gives the density plots. The benefit of having draws of the 
parameters is that one can then easily do inference on any transformation of 
the parameters. For monotonic transformations g(8), such as g(8) = exp() 
or g(8) = exp(Gx) , a 95% credible interval is the interval between the 2.5 
percentile and 97.5 percentile of the 10,000 values of g((). There is no need 
to use the delta method. 


Density plots 
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Figure 29.1. Posterior densities of the three regression parameters 
and the error variance parameter 


It is important to look at various diagnostics to ensure that the 
computational method has led to sensible results. In particular, the MCMC 
iterative procedure needs to have converged. Various checks and potential 
pitfalls are presented in this chapter. 


29.3 Bayesian methods overview 


In this section, and more generally in this chapter, we provide a very brief 
introduction to Bayesian methods. Bayesian analysis is presented in many 
complete books, including the econometrics text of Koop (2003) and the 
statistics text of Gelman et al. (2013). See also Cameron and Trivedi (2005, 
chap. 13). 


29.3.1 Posterior distribution 


Let the data on N observations be denoted (y, X), where y is data on the 
dependent variables and X is data on exogenous regressors. And let 0 
denote the Ķ parameters of the conditional density of y given X. 


The key ingredients for Bayesian analysis are a specified likelihood 
function for the data and a specified prior density for the parameters. These 
are combined to give the posterior density as follows: 


1. Likelihood function (for data given parameters), denoted L(y|0, X). 
2. Prior density (for parameters), denoted 7(@). 
3. Posterior density (for parameters given data), denoted p(@ly, X). 


As an example, we might assume that y, conditional on data X and 
parameters 83, is normally distributed with mean X’G and variance matrix 
g2], where g2 is known. One possible prior for 8 is that it is normally 
distributed with specified mean and variance. For this example, a weaker 
prior for 68 has larger prior covariance matrix. 


The posterior density is defined as 


7 L(y|0,X) x 7(8) 


p(8ly, X) = my |X) (29.1) 


where 


m(y|X) = [1010.X) x 1(0)d0 


is called the marginal likelihood. This result is obtained by repeated use of 
Bayes rule. First, note that Pr[A|B] = Pr[A N B]/ Pr[B] by Bayes rule. 
Second, Pr[A N B| = Pr|B|A] x Pr[A] by the multiplication rule. 
Combining yields Pr|A|B] = Pr[B|A] x Pr[A]/ Pr[B]. Here Q corresponds 
to A, y corresponds to B, and we have additionally conditioned on X 
throughout. 


The marginal likelihood is a normalizing constant that does not depend 
on 0, and hence the result is more simply expressed as 


p(Oly, X) x L(y|@, X) x 78) 


Posterior « Likelihood x Prior 
where the symbol œ means “is proportional to”. 


The challenge in implementing the above is obtaining the normalizing 
constant m(y|X) because it involves a k-dimensional integral that has no 
analytical solution except in some very simple models. Initial work used 
numerical methods such as importance sampling to compute this integral. 
More recently, MCMC methods provide a way to make draws from the 
posterior p(@|y, X) without having to compute the marginal likelihood 
m(y|X). 


29.3.2 MCMC methods 


The target object of inference is the unknown posterior distribution 

p(9 | y, X). McMc methods provide a computational methodology for 
sequentially creating a chain that converges to the target posterior, which is 
called the stationary distribution. The key insight is that even if the 
mathematical expression of the posterior is not known, a large sample of 
draws from the posterior is equivalent to it. It is then straightforward to 
estimate the moments of the posterior, such as the posterior mean, as well as 
functions of estimates and data, such as marginal effects (MEs) in a nonlinear 
regression model. 


The goal then is to make draws of 0 from the posterior. One way to do so 
is to make draws from a proposal or candidate distribution, such as the 
normal, and then use a rule to decide whether to accept such draws as being 
draws from the target distribution, here the posterior distribution. 


The simplest such method, accept—reject sampling, is not possible, 
because it uses an acceptance rule based on the functional form for the 
posterior distribution; in most cases, this is unknown because of lack of an 
analytical expression for the marginal likelihood m(y|X). 


Instead, MCMC sampling methods are used. These make posterior draws, 
denoted 0, in the sth round, based on the draw @,_ in the previous round. 
The term “Monte Carlo” arises because of reliance on random draws. The 
draws form a Markov chain because once we condition on @,_1, the 
preceding draws 0,_2,0,-3,... have no additional impact on @,. 


Under appropriate assumptions, Markov chains eventually converge to a 
stationary distribution, in which case subsequent draws 0,11, 0,42,... all 
have the same marginal distribution. Bayes MCMC methods are set up so that 
the stationary distribution is the desired posterior p(0|y, X). 


The posterior draws are clearly correlated because 0, depends on 6, _ 1. 
There is nonetheless no problem in using the draws from the posterior, 
though if they are highly correlated, then one needs to make more draws to 
ensure sufficient precision. 


MCMC methods potentially enable one to specify very complicated 
models. But it is easy to perform invalid analysis. Poorly identified or even 
unidentified models can yield seemingly reasonable results. 


Furthermore, even in well-specified models, there is no guarantee that 
the Markov chain has converged to its stationary distribution. The Stata 
default is to discard the first 2,500 Mcmc draws, called “burn-in” draws, and 
to retain only subsequent draws. But the chain may take many more draws 
than this to converge. Unfortunately, there is no formal test for whether the 
chain has converged; some diagnostics are presented in sections 29.4 
and 29.6. 


29.3.3 Metropolis—Hastings algorithm 


The main Mcmc method is the Metropolis—Hastings (MH) algorithm. This 
algorithm uses an acceptance rule based on a ratio of two posterior densities 
p(@|y, X) for which the (unknown) marginal likelihood term cancels out. 
Specifically, for draws s = 1,...,S, 


1. generate a draw 0% from the proposal density or candidate density 
q(0*|@,_1). For the commonly used random-walk MH algorithm, the 
proposal density is the normal or multivariate normal with mean 0s—1. 

2. calculate the acceptance probability to equal @s, where 

as = min{r(0;|6,_1),1} 
P(Osly,X) x q(As-1|95) 
p(Os—i1ly, X) x q(A5|As—1) 


3. draw Us from the uniform (0, 1) distribution. 
4. accept 0, = 0; if us < as and otherwise set 0s = 05-1. 


r(05|@s—1) = 


In step 2, the ratio of posteriors p(0;|y, X)/p(@;,_1|y, X) simplifies to 


L(Osly,X) x 7(95) 
L(6;_-1|y, X) x T (0:1) 


because the marginal likelihood m(y|X) is a common term that does not 
depend on @ and hence cancels out in the ratio. 


Further simplification is possible if the proposal density is a symmetric 
density, satisfying q(@,_1|0,) = q(@,|0,_1). An example is the random- 
walk MH algorithm with 02|6,-1 ~ N[@;_1, V] for fixed V. In that case, 
the MH algorithm is strictly speaking the Metropolis algorithm, but the more 
general terminology “MH algorithm” is still commonly used. 


29.3.4 Sampling efficiency 


MCMC methods lead to draws from the posterior that are correlated. As a 
result, 10,000 Mcmc draws are equivalent to many fewer independent draws. 


Let p; denote the jth autocorrelation of the posterior draws. The 
sampling efficiency statistic given in the bayesmh output equals 
1/(1 +2 x } -j1 Pj), where max is the minimum of 500 and the lag at 
which |p;| < 0.01. 


For example, if the sampling efficiency is 0.05, then 10,000 Mcmc draws 
are equivalent to 0.05 x 10000 = 500 independent draws. One way that 
sampling efficiency of 0.05 can arise is if p; = 0.95’. Sampling efficiency 
this low is not unusual. 


29.3.5 Gibbs sampling algorithm 


The Gibbs sampler is an MCMC method that has smaller autocorrelation of the 
MCMC draws, leading to higher sampling efficiency. This method can be 
applied in special cases where analytical results are available for one or more 
conditional posteriors (as usual, analytical results for the unconditional 
posterior are not available). 


Consider the case where parameters are split into two blocks, so @ = (6' 
, 9”). If we knew both the conditional posterior of 9'|@? and the conditional 
posterior of 9?|9', then we could directly apply the Gibbs sampler, 
alternating draws from the two conditional posteriors. Section 5.4.6 provides 
a Gibbs sampler example. 


In practice, this is rarely possible. But in some cases, we may know one 
of the conditional posteriors, say, that for 9! |0’. Then, we draw 0! from the 
conditional posterior of 0t CA and use the MH algorithm to draw 6? given 
(0! 102). The manual entry for [BAYES] bayesmh gives combinations of 
likelihood and prior for which such a hybrid MH scheme is possible. An 
example is given in section 29.7.2. 


29.3.6 Bayesian and classical approaches compared 


In classical statistics, given data and the likelihood function for population 
parameters 9, we obtain the MLE g, which is random because of sampling 
error. In all but the simplest models, statistical inference on g is based on 
asymptotic approximations. A resulting 95% confidence interval for a scalar 
parameter o is interpreted as one that with repeated sampling of the 
population will include g 95% of the time. 


The Bayesian approach introduces randomness via prior beliefs on the 
value of the population parameters that is not exact. This prior information 
on g is combined with sample information to yield the posterior distribution 
of Q. Given the posterior distribution, statistical inference on @ is direct and 
does not require asymptotic approximations. And a 95% Bayesian credible 
region is directly interpreted as being an interval that scalar @ lies in with 
probability 0.95. 


Additionally, there are decision-theoretic underpinnings for the Bayesian 
approach. Suppose we seek an estimate g of o that minimizes a loss function 
L(@, 6). Then the optimal estimator is the estimator that minimizes expected 
posterior loss, where expectation is with respect to the posterior distribution 
of 0. The posterior mean is optimal given quadratic loss, and the posterior 
median is optimal given absolute error loss. Furthermore, the optimal Bayes 
estimator minimizes expected risk for a specified loss function, where 
expected risk averages expected posterior loss over possible samples (by 
additionally integrating with respect to the likelihood function). 


Rather than provide a Bayesian interpretation to results, some applied 
statisticians and econometricians use Bayesian methods with a 
noninformative prior as a computational tool to obtain the MLE in settings 
where alternative computational methods such as maximum simulated 
likelihood or quadrature are not as computationally efficient. In that case, the 
posterior mode, the peak of the posterior density, should be used because 
maximizing the likelihood function corresponds to finding the mode. 


The bayes prefix and bayesmh command do not report the posterior 
mode because it is not of interest to Bayesians and because its computation 
can be problematic if there are multiple peaks of the posterior such as in the 
density plots in figure 29.1. The posterior mean is often used instead for 


convenience and is close to the posterior mode if the posterior density is 
unimodal and symmetric. 


As the sample size gets larger, the difference between the posterior mean 
and the MLE narrows. For simplicity, assume that the observations are 1.1.d. 
Then, taking the logarithm of the posterior defined in (29.1), we have 


N 


N 
S_Inp(Olys:) x n7(8) +X ln f(y) 


i=l = 


In a large sample, the posterior is dominated by the likelihood contribution 
because the contribution of the prior to the posterior remains fixed, while the 
contribution of the sample to the posterior grows with N. Furthermore, if the 
likelihood satisfies the standard regularity conditions, so that the MLE is root- 
N consistent and asymptotically normal, it can be shown that the posterior 
distribution of @ has an asymptotic normal distribution with mean the MLE 
and variance the inverse of the information matrix; see, for example, 
Cameron and Trivedi (2005, 432). 


In the presentation below, we contrast the posterior mean, standard 
deviation, and 95% credible region of a parameter with the corresponding 
MLE, standard error, and 95% confidence interval. Differences between the 
two reflect in part the informativeness of the prior and finite sample 
asymmetry in the posterior distribution. 


29.3.7 The bayesmh command 


The bayesmh command is the essential Stata command for Bayesian 
analysis. The command covers a very broad range of models, one broader 
than the simpler bayes prefix, introduced in section 29.2.1. 


The basic syntax for this command for a univariate regression model is 


bayesmh depvar [ indepvars | [ if | [ in | [ weight | , likelihood(modelspec) prior (priorspec) 
[ options | 


The likelihood can be that for leading models, including generalized 
linear models (normal, probit, logit, poisson, exponential), as well as 
lognormal and ordered probit and logit. Additionally, a user-provided log 
likelihood can be specified using the option evaluator (). 


The prior distribution for a scalar parameter, discussed further below, can 
be normal, Student’s t, Cauchy, lognormal, uniform, gamma, inverse- 
gamma, exponential, beta, Laplace, Pareto, chi-squared, flat, Jeffreys, 
Bernoulli, Poisson, geometric, or a discrete index. In the vector case, priors 
include the multivariate normal, Wishart, inverse-Wishart, Dirichlet, and 
Jeffreys and Zellner’s g-prior. Additionally, a user-provided prior density can 
be specified. For replicability, one should set the seed using the rseed () 
option of the bayesmh command because results will always vary with the 
initial seed. 


Postestimation command bayesstats summary provides summary 
statistics for functions of the parameters; commands bayesgraph, 
bayesstats ess, and bayesstats grubin present diagnostics for the MCMC 
draws; commands bayesstats ic, bayestest interval, and bayestest 
model are used for Bayesian inference and model selection; command 
bayesstats ppvalues is used for posterior model checks; and command 
bayespredict is used for Bayesian prediction. All but the last two of these 
commands are also available after the bayes prefix. 


29.4 An i.i.d. example 


As an example, consider analysis of 1.1.d. normally distributed data with 
mean 4 and known variance of 100. We consider Bayesian inference on /4 
given a normal prior for u. The analysis is quite detailed and illustrates Stata 
Bayesian commands that extend directly to multiple regression. 


29.4.1 MLE 


Suppose y; | ~ N |u, 100]. We generate a sample of 50 observations that 
turns out to have sample mean 10.72 and standard deviation 10.90. 


. * Generate a sample of 50 observations on y 
. clear 


. qui set obs 50 

. set seed 10101 

. gen y = rnormal(10,10) 
. summarize 


Variable Obs Mean Std. dev. Min Max 


y 50 10.71767 10.90006 -21.70636 38.1053 


In this example, the MLE is the sample mean, so f = y ~ 10.72, with 
variance s/y N = 10.90/./50 ~ 1.54. Command mean yields 


. * The MLE is the sample mean 


. Mean y 
Mean estimation Number of obs = 50 
Mean Std. err. [95% conf. interval] 
y 10.71767 1.541502 7.61991 13.81544 


The resulting 95% confidence interval for H is [7.62, 13.82]. This interval is 
interpreted as meaning that if we were to repeat this estimation procedure 
many times on many independent samples, then 95% of the time the 
resulting confidence interval will include the unknown constant H. 


29.4.2 Bayesian analysis 


We use command bayesmh with option 1ikelihood() used to specify the 
likelihood and option prior () used to specify the prior. 


Here we specify the prior to be u ~ N[5, 4]. 


. * Bayesian posterior for mu with normal y and N(5,4) prior for mu 
. bayesmh y, likelihood(normal(100)) prior({y:_cons}, normal(5,4)}) 


> rseed(10101) saving(mcmcdraws_iid, replace) 


Burn-in ... 
Simulation ... 


Model summary 


Likelihood: 
y ~ normal({y:_cons},100) 


Prior: 
{y:_cons} ~ normal(5,4) 


Bayesian normal regression 
Random-walk Metropolis—Hastings sampling 


Log marginal-likelihood = -193.45168 


MCMC iterations = 12,500 
Burn-in = 2,500 
MCMC sample size = 10,000 
Number of obs = 50 
Acceptance rate = . 4332 
Efficiency = . 2282 


y Mean Std. dev. MCSE 


Equal-tailed 
Median [95% cred. interval] 


_cons 8.797346 1.162716 .02434 


8.832482 6.502072 11.0129 


file mcmcdraws_iid.dta not found; file saved. 


The results are obtained using random-walk MH sampling. This is the MH 
algorithm with proposal density for uš the N (us—1, 02) distribution, where 
the complicated rule to periodically update øg? is detailed in 


[BAYES] bayesmh. 


The Stata default is to discard the initial 2,500 “burn-in” draws of 4 
because the MH algorithm takes time to converge to the posterior 
distribution. The Stata default is to retain the subsequent 10,000 draws; these 
are viewed as (correlated) draws from the posterior. 


The acceptance rate is 0.4332, meaning that 4,332 of the 10,000 draws 
accept us = už and so are different from the immediately preceding draw, 
while for the remaining 5,668 draws Hs = /s—1. If the acceptance rate is 
close to zero, then few draws are accepted, so only a small part of the 
posterior has been searched. If the acceptance rate is close to one, then we 
are drawing too often from the proposal distribution, not the posterior 
distribution. For the random-walk MH algorithm, studies suggest that the 
optimal acceptance rate is a bit below 0.5 for the univariate parameter case 
and a bit below 0.25 in the multiparameter case. So an acceptance rate of 
0.4432 is very good. 


The more important efficiency statistic of 0.2282 means that the 10,000 
correlated draws provide the same information content as 2,282 independent 
draws from the posterior. This loss in efficiency arises both because only 
4,332 draws are accepted and because even these accepted draws are 
correlated. An efficiency of 0.22 is regarded as very good. 


From the results section of the Stata output, the posterior mean is 8.80 
and the posterior standard deviation is 1.16. The posterior mean lies between 
the sample MLE jj = y ~ 10.72 and the prior mean of u, which equaled 5. 
The posterior standard deviation of 1.16 is less than the standard error of 
n œ~ 1.54. 


This result accords with intuition. The posterior mean of a parameter lies 
between the sample MLE and the prior mean, and prior information reduces 
variability so that the posterior variance of the parameter is less than the 
variance of the MLE of the parameter. In some standard models, this result 
always holds; see section 29.4.8 for analytical results for this example. 


29.4.3 Bayesian inference 


The equal-tailed Bayesian credible region, also given in the preceding Stata 
output, lies between the 2.5 and 97.5 percentiles of the 10,000 MCMC 
posterior draws of wu. It is directly interpreted as saying that + lies between 
6.50 and 11.01 with probability 0.95. This interval is narrower than the 95% 
confidence interval for the MLE of [7.62, 13.82] because prior information 
reduces variability. And, as already noted, the interpretation of the two 
intervals is quite different. 


The posterior probability that x lies in a certain range is simply the 
fraction of the 10,000 memc draws that lie in the given range. 


For example, command bayestest interval finds that the probability 
that u exceeds 10 equals 0.1423 because 


. * Bayesian hypothesis test: Pr[mu > 10] 
. bayestest interval {y:_cons}, lower(10) 


Interval tests MCMC sample size = 10,000 
prob1i : {y:_cons} > 10 


Mean Std. dev. MCSE 


probi . 1423 0.34938 .0066141 


Inference on transformations of the parameters is also straightforward. 
For example, the posterior distribution of u? is obtained by simply squaring 
each of the 10,000 Mcmc draws of u. This is done using command 
bayesstats summary. 


. * Bayesian statistics for transformation of parameter mu 
. bayesstats summary ({y:_cons}72) 


Posterior summary statistics MCMC sample size = 10,000 


expri : {y:_cons}*2 


Equal-tailed 
Mean Std. dev. MCSE Median [95% cred. interval] 


expri 78.74506 20.48997 . 426352 78.01275 42.27698 121.284 


In this example, all draws of u are positive. It follows that the qth 
percentile of ;,2 equals the square of the qth percentile of u. For example, the 
median of / is 8.83248, and the median of u? is 8.832482 = 78.0128. 
Similarly, given that the 95% credible region for u? is equal to [ 

6.5021, 11.0129], the 95% credible region for ;,7 is equal to [ 
6.50217, 11.01297] or [42.27, 121.28]. 


Note that if instead we used the MLE, then we would use the delta 
method, which relies on asymptotic theory. Because ôu? /Ou = 2u, and 
ji = y, the standard error of 77? = y? = 10.7182 = 114.88 equals 


2y x se(y) = 2 x 10.718 x 1.541 = 33.04, and an asymptotic 95% 
confidence interval for yu? is 114.88 + 1.96 x 33.04 = [50.12, 179.64]. 


29.4.4 MCMC diagnostics 


It is absolutely essential to gauge whether the Markov chain has converged 
and to measure the computational efficiency of the MH algorithm. However, 
there is no formal test that confirms that the algorithm has converged. 


Several useful graphs can be obtained using command bayesgraph 
diagnostics. We obtain 


. * Diagnostic plots for MH posterior draws 
. bayesgraph diagnostics {y:_cons}, scale(1.1) 


y:_cons 
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Figure 29.2. Diagnostics for 10,000 MH posterior draws of H 


The top left panel of figure 29.2 plots the 10,000 sequential MH draws of 
H. Red flags include a trend or cycle in the draws and little change from 
draw to draw; there is no problem here. 


The top right panel of figure 29.2 gives a histogram of the draws. Care is 
needed if this is multimodal because it may indicate that the chain has not 


converged but is instead getting trapped in subareas of the posterior. 


The bottom left panel of figure 29.2 shows the autocorrelations of the 
draws. These should die out quickly, the case here. The efficiency statistic of 


max ^ 


0.2282 given in the bayesmh output equals 1 / (1 ek J1 pi). where p} 


is the 7th autocorrelation of the posterior draws and max is the minimum of 
500 and the lag at which |p;| < 0.01. 


If the posterior draws were independent, then the Monte Carlo 
simulation error, reported in output as McsE, would equal the posterior 
standard deviation So divided by \/S, where § is the number of Mcmc draws. 
In fact, the draws are correlated, and the larger measure sg /./S/eff is used 


where eff is the efficiency statistic. 


The bottom right panel of figure 29.2 shows the kernel density estimate 
for all 10,000 draws, a smoothed version of the histogram in the top right 
panel. If the chain has converged, then the dashed kernel density estimate for 
the first 5,000 draws should be very similar to the dotted estimate for the 
second 5,000 draws. At the same time, however, similar density estimates 
for the two halves do not guarantee convergence. 


All four panels suggest that the chain has converged and is relatively 
efficient. 


29.4.5 MCMC diagnostics using different chains 


A useful diagnostic method for chain convergence is to use multiple chains. 
If the posterior differs substantially across chains, then the chain has not 
converged. The test can be made more stringent by using quite different, and 
potentially quite extreme, starting values for each chain. 


The nchains () option of the bayesmh command or the payes prefix 
enables estimation of multiple chains. 


Before using the nchains() option, we manually fit 4 distinct chains 
with initial values that are random draws from M (10, 67) that should give a 


wide range of starting values because the MLE for /4 was 10.72 with standard 
deviation 1.54. 


. * Manually obtain posterior means and st. devs for four different chains 
. set seed 10101 


. forvalues i=1/4 { 
2. local start = rnormal(10,6°2) 
3% quietly bayesmh y, likelihood(normal (100) ) 

> prior ({y:_cons}, normal(5,4)) 

> initial ({y:_cons} `start’) 
4 matrix pstart = (nullmat(pstart) \ ~start”) 
5. matrix pmeans = (nullmat(pmeans) \ e(mean)) 
6 matrix psds = (nullmat(psds) \ e(sd)) 
7. } 


. matrix p_all = pstart,pmeans,psds 
. matrix list p_all, title("Start value, post. means and st. devs. for 4 chains") 


p_al1[4,3]: Start value, post. means and st. devs. for 4 chains 
y: y: 
c1 -cons -cons 
rl 24.119756 38.8182371 1.1682963 
r2  7.6265723 8.8136099 1.1483331 
r3 11.02609 8.8414756 1.1638707 
r4 -23.910821 8.8207744 1.1702667 


The four starting values are very different but lead to very similar posterior 
means and standard deviations for the parameter. The function nullmat () is 
used because the matrix pstart, for example, is not previously defined at the 
first pass through the loop. 


The following code computes both the between variation (across the 4 
starting values) and the within variation (across the 10,000 MH draws for a 
given starting value) for the posterior mean. 


. * Compute within and between variation across the 4 MCMC runs 
. mata 


oo atta (type end to exit) 
pmeans = st_matrix("pmeans") 


psd = st_matrix("psds") 

W = (psdpsd)/4 

meanpmeans = mean(pmeans) 

B = (pmeans-meanpmeans*J(4,1,1)) “ (pmeans-meanpmeans*J(4,1,1))/(4-1) 


// (1) Between, (2) within, (3) Between/total, (4) Total / Within 
(B, W, B/(B+W), (B+W)/W) 
1 2 3 4 


1 .0001520204 1.351926088 0001124346 1.000112447 


: end 


The variation between the four MH runs is very small relative to the within 
variation, as desired. Similar to the other diagnostics, failing this test signals 
nonconvergence of the MH algorithm, but passing the test does not guarantee 
convergence of the chain. 


The nchains () option of the bayesmh command or payes prefix 
automates the preceding process. Using four chains, we obtain 


* Four different chains using nchains() option 
bayesmh y, nchains(4) likelihood(normal(100)) rseed(10101) 
prior({y:_cons}, normal(5,4)) initi({y:_cons} 24) 
init2({y:_cons} 7) init3({y:_cons} 11) 
init4({y:_cons} -23) 
initsummary nomodelsummary 


VVVMVMs 


Chain 1 
Burn-in ... 
Simulation ... 
Chain 2 
Burn-in ... 
Simulation ... 
Chain 3 
Burn-in ... 
Simulation .. 
Chain 4 
Burn-in ... 
Simulation .. 


Initial values: 
Chain 1: f{y:_cons} 24 


Chain 2: ‘f{y:_cons} 7 
Chain 3: {y:_cons} 11 
Chain 4: f{y:_cons} -23 
Bayesian normal regression Number of chains = 4 
Random-walk Metropolis-Hastings sampling Per MCMC chain: 
Iterations = 12,500 
Burn-in = 2,500 
Sample size = 10,000 
Number of obs = 50 
Avg acceptance rate = . 4383 
Avg efficiency = .2314 
Avg log marginal-likelihood = -193.45686 Max Gelman-Rubin Rc = 1 
Equal-tailed 
y Mean Std. dev. MCSE Median [95% cred. interval] 


_cons 8.812606 1.156895 .012025 8.816636 6.540255 11.07231 


Note that the rseed() option used here ensures that a different seed is used 
for each chain. The default uses parameter starting values that do not vary 
greatly across each chain. Here the init1() to init4() options are used to 
specify a wider range of starting values. 


The output includes the Gelman—Rubin rc statistic. Let 0, and 85 2 be the 
posterior mean and variance in the jth chain, @ = 1/m ie lj; 


B=1/(m—-1) X; (0; — 6)? measure the variation between chains, and 


W =1/m D 57 denote the average variation within each chain. A crude 
measure of chain convergence is small between variation relative to within 
variation, in which case (W + B)/W is close to one. The rc statistic is a 
refined version of this measure that adjusts for a finite number of chains. A 
common threshold for chain convergence is that Rc < 1.1. Because Rc = 1, 


upon rounding, we conclude the chain has converged. 


The preceding output includes the maximum value across all parameters 
of the Rc statistic. The bayesstats grubin command provides the Rc 
statistic for each parameter. 


. bayesstats grubin 


Gelman-Rubin convergence diagnostic 


Number of chains = 4 
MCMC size, per chain = 10,000 
Max Gelman-Rubin Rc = 1.000105 


_cons 1.000105 


Convergence rule: Rc < 1.1 


Here there is only one parameter and Rc = 1.000105, which equals 1 when 
rounded. 


29.4.6 Sensitivity analysis 


One should also check the sensitivity of results to the specification of the 
prior, where the priors should be consistent with prior information and 
potentially use different distributions. 


As an example, we consider a N (4, 8) prior and a y?(4) prior. Both 
priors have mean 4 and variance 8 but differ because the chi-squared 
distribution is asymmetric and takes positive values only. We have 


. * Compare posterior for two different priors 
. quietly bayesmh y, likelihood(normal(100)) prior({y:_cons}, normal (4,8)) 
> rseed(10101) 


. matrix pmeans = e(mean) 
. matrix psds = e(sd) 


. qui bayesmh y, likelihood(normal(100)) prior({y:_cons}, chi2(4)) 
> rseed(10101) 


. matrix pmeans = pmeans \ e(mean) 

. matrix psds = psds \ e(sd) 

. Matrix p_all = pmeans,psds 

. matrix list p_all, title("Post. means and st. devs. for 2 different priors") 


p_all[2,2]: Post. means and st. devs. for 2 different priors 
y: y: 
_cons _cons 
Mean 9.3516909 1.2641803 
Mean 9.9058014 1.4079165 


The posterior means are within half a posterior standard deviation of each 
other. 


29.4.7 Further analysis of the draws 


The results of the original use of the bayesmh command in this section were 
saved in file memcdraws iid.dta. This file contains the 4,332 unique draws 
of Hs, stored in variable eq1 pi (for equation 1 parameter 1), along with the 
number of consecutive times that each unique draw was obtained, stored in 
variable frequency. 


* Summarize the unique retained draws 
. use mcemcdraws_iid, clear 


summarize 
Variable Obs Mean Std. dev. Min Max 
_chain 4,332 1 (0) 1 1 
_index 4,332 5018.283 2898.168 1 9998 
_loglikeli”~d 4,332 -191.4893 1.30697 -199.2417 -190.185 
_logposter“r 4,332 -195.1071 . 7662066 -201.9883 -194.5214 
eq1_p1 4,332 8.805722 1.249882 4.698787 13.27406 


-frequency 4,332 2.308403 1.771655 1 14 


The mean and standard deviation of variable eq1_p1 differ from the 
previously reported posterior mean and posterior standard deviation of, 
respectively, 8.797 and 1.163. This is because the posterior mean and 
standard deviation given in the bayesmh output are computed using 
frequency weights for the repeated values. 


All 10,000 mcmc draws, including the repeated draws, can be obtained 
by expanding the dataset and sorting by variable index. Summarizing the 
complete dataset yields 


. * Expand to get the 10,000 MH draws, including repeated draws 
. expand _frequency 
(5,668 observations created) 


. sort _index 
. gen s = _n 
. summarize eqi_pi 


Variable Obs Mean Std. dev. Min Max 


eqi_p1 10,000 8.797346 1.162716 4.698787 13.27406 


The variable eq1_p1 contains the 10,000 draws of u, including repeated 
values. This variable has mean 8.797346 and standard deviation 1.162716, 
which exactly equal the posterior mean and standard deviation in the output 
given after command bayesmh. 


Given all 10,000 draws of u, we can reproduce the various graphs 
obtained earlier using command bayesgraph. For example, the following 
code obtains the trace for just the first 50 McMc draws. 


. * Graph the first 50 draws of mu 
. quietly tsset s 


. tsline eqi_p1 if s < 50, scale(1.5) ytitle("Parameter mu") 
> xtitle("MCMC posterior draw number") 


The left panel of figure 29.3 shows the first 50 posterior draws of u. The 
flat sections are cases where the draw už was not accepted and instead 
Hs = Hs-1. 
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Figure 29.3. Diagnostics for 10,000 MH posterior draws of mu 


The right panel of figure 29.3, produced by Stata code not given here, 
plots the normal likelihood, the normal prior density, and the kernel density 
estimate of the posterior. It is clear that the posterior mean lies between the 
sample mean and the prior mean and that the posterior variance is less than 
either the sample or prior variance. 


29.4.8 Analytical results 


In the current example, it is actually possible to obtain a tractable solution 
for the posterior, in which case the MCMC methods are not necessary. 


Suppose y;|u ~ N (u, 07), where g? is known. Then for 
y = (y1,---, yn) and independent data, the likelihood is 
L(y|u) = [{1/(2m0?)}]%/? exp[—{1/(207)} DL, (yi — p)? And suppose 
the prior for H is u ~ N (u, 8”), where values for and s? are specified and 
an underscore is used to denote the values of parameters in the prior. Then 


the prior density is (u) = (1/278?) exp {—{1/(2s”)}(u — p)?}. 


Some very considerable algebra given in Cameron and Trivedi (2005, 
chap. 13.2.2), for example, shows that the posterior distribution for 4 given 
y is then the normal with 


where an overscore is used to denote the values of parameters in the 
posterior. 


The Bayesian terminology is to call the inverse of the variance the 
precision. So the posterior precision of H, 5-2 defined above, is the sum of 
the sample precision of y and the prior precision of u. The posterior mean, / 
defined above, is a weighted average of the sample mean and the prior mean, 
where the respective weights are the sample and prior precisions. 


In our example, y = 10.72, g2 = 100, N = 50, and the prior sets u = 5 
and s? = 4. This yields 5? = 4/3 = 1.33 and g = 8.81, close to the values 
1.1632 = 1.35 and 8.80 obtained from the MH draws from the posterior. 


In general, as the sample gets large, the information from the sample 
increases, while the information from the prior is unchanging, so the 
Bayesian posterior mean for u will collapse on y, the MLE for H. For this 
example, this is clear from rewriting the posterior variance as 
3? = (o? /N)/{1+ (o? /Ns?)} > o?/N as N > œ and 
B= {y+ (@°u/Ns?)} {1 + (0° /Ns?)} > 9 as N > o. 


By the same algebra, it is clear that as the prior variance gets large, so 
s? — oo, We again obtain 3? — (o? /N) and g — y. With a weak prior, the 
Bayesian posterior mean goes to the MLE. This example illustrates the more 
general result that for Bayesian analysis asymptotically (as sample size goes 
to infinity), the likelihood dominates, and the prior has little impact on the 
posterior. 


29.5 Linear regression 


We now consider extension to linear regression under normality. A brief 
review is given of results given in many Bayesian texts and in Cameron and 
Trivedi (2005, chap. 13). 


29.5.1 Normal regression with variance known 


The preceding normal-normal result for 1.1.d. data extends to the linear 
regression under normality with normal prior for 3 and Var(y|X) = o7I 
and g2 known. Then the likelihood is y|G, X ~ N(X{, o71), the prior is 
B ~ N(B, V) and the posterior is Bly, X ~ N(B, V), where 6 is a 
matrix-weighted sum of the MLE (equal to oLs) and the posterior mean and 
V is the inverse of the sum of the precision of the MLE and the prior 
precision. 


Specifically, the posterior variance and mean of 8 are, respectively, 
V = {o(X'X) TeV 
B = Vo? XX) Bors + V'A] 
This result is of little practical use because in general g2 is unknown. 


29.5.2 Prior distributions for G and g2 


In the more usual case of unknown error variance g?, and possibly even 
unknown error variance matrix X, there are several possible choices for the 
priors for 6 and g2 that we now summarize. These lead to different 
posterior distributions. 


A noninformative prior (or diffuse prior or weak prior) is one that has 
little impact on the posterior distribution. An informative prior is one that 
does have an impact. 


The simplest choice is to choose independent flat or constant priors for 
B and g2. Thus, we assume that 7(G) = 1 and (o?) = 1. This seems a 
poor choice because then the prior is an improper density: it does not 
integrate to one given that 3 and g2 are unbounded. Nonetheless, it leads to 
a proper posterior density in many leading examples, due to cancellation in 
the numerator and denominator of the expression for the posterior density 
p(O|y, X) defined in (29.1). However, care is then needed in using 
Bayesian model-selection methods, detailed in section 29.9, that are based 
on the marginal likelihood. 


A flat prior is an example of a noninformative prior. A weakness 1s that 
it is not invariant to parameter transformation. For example, a constant prior 
for o differs from a constant prior for g2. 


Jeffreys prior, by contrast, is a noninformative prior that is invariant to 
parameter transformation. It sets 7(@) proportional to [det {I(@)}]!/, 
where det{J(@)} is the determinant of [(@) = —E(0? ln L/0006’). For 
the linear model under normality, the Jeffreys priors are 7(3) « 1 and 
mla al/a 


Conjugate priors are priors that lead to tractable results, with posterior 
in the same family of distributions as the prior. A natural conjugate prior is 
one for which additionally the likelihood is in the same family as the prior. 
In that case, the prior can be interpreted as providing additional data. 
Natural conjugate priors exist for densities in the exponential family. 


For the normal model with g2 known, the natural conjugate prior for G 
is the normal and leads to a normal posterior, a result given in the previous 
subsection. 


For the normal model with 52 unknown, the conjugate prior for the 
precision parameter 1/øg° is the gamma; equivalently, the conjugate prior 
for g? is the inverse-gamma. 


Stata uses the shape-scale parameterization of the inverse-gamma, 
denoted z ~ Gla, b], where a is the shape parameter and b is the scale 
parameter. Then z has mean qb (for a > 1) and variance a/b (for a > 2). A 


prior for g2 that is inverse-gamma with shape parameter 0/2 and scale 
parameter 021, /2 can be viewed as adding Vo extra observations with 
average error variance gĝ; see, for example, Gelman et al. (2013). Then the 
scale parameter is a7 times the shape parameter, and smaller values of vo 
correspond to a weaker prior. 


More generally, when Var(y) = &, rather than 52], a Wishart prior for 
>»! can be used. This is a matrix generalization of using the gamma prior 
for 1/07. 


Conjugate priors are examples of informative priors. They can be made 
vague or diffuse or flat or noninformative by specifying a large variance in 
the prior. Like the flat prior, the resulting posterior is not invariant to 
reparameterization. Hierarchical priors are priors whose parameters in turn 
are random, depending on parameters for which a prior is specified. These 
are presented in section 29.8.2. 


It is clear that a wide range of prior distributions might be used, unless 
analysis is being conducted in a setting where prior beliefs suggest a 
particular functional form for the prior. A range of priors might be used to 
determine the sensitivity of results to the choice of prior. 


29.5.3 Posterior densities for normal regression with variance unknown 


Given modern MCMC methods, there is no need to restrict attention to priors 
that lead to tractable results for the posterior. Thus, we might routinely 
specify normal priors or multivariate Student ¢ priors for parameters in 
(—oo, oo), gamma or lognormal priors for parameters in (0, oo), and beta 
priors for parameters in (0, 1), regardless of the specified likelihood. And 
flat priors, Jeffreys priors, or priors with large variances may be used for 
parameters for which there is little prior knowledge. 


At the same time, for more complicated models, it can be challenging to 
get the MH algorithm to converge, and analytical results, even for a 
subcomponent of the posterior, can greatly improve computational 
efficiency; section 29.7.1 provides an example. In such cases, known 


tractable results for the linear regression model under normality are often 
used. 


For the normal model with independent Jeffreys priors, we have 
m(3,07) = (8) x n(o?) x 1/07. Then p(Gly, X), the marginal posterior 
for 3, can be shown to be the multivariate Student ¢ distribution centered at 
Borg with (N — K) degrees of freedom and covariance matrix s?(X’'X)~! 
multiplied by (N — K)/(N — K — 2). 


A tractable result with informative priors is obtained by specifying that 
the prior for 6B given g? is the normal and the prior for g2 is the inverse- 
gamma. Then 7(3,07) = 1(B|o7) x n(o?°). The resulting joint posterior 
for 3 and g? is of similar normal inverse-gamma form. The conditional 
posterior for 3|c? has mean that is a matrix weighted average of Bors and 
the prior mean. The marginal posterior for 8 is a multivariate Student ¢ 
distribution centered at the posterior mean of 8. 


It is simplest to assume that priors for the components of are 
independent, but this is rather ad hoc. Zellner’s g-prior introduces 
correlation by supposing that 3|c? is normally distributed with mean 6 and 
variance matrix g x ¢?(X’X)7~!, proportional to the variance matrix of the 
MLE for 3. Larger specified values for g lead to a weaker prior. 


29.6 A linear regression example 


We continue the linear regression for the log earnings example introduced in 
section 29.2. We suppose that interest lies in the slope coefficient for 
education and there is prior information on this parameter. For illustrative 
purposes, an informative prior is used for all model parameters. In practice, 
the Jeffreys prior or flat priors or parametric priors with large variances 
might be used for parameters for which there is little prior information. The 
sample is small, with 100 observations. Then prior information, if 
reasonably tight, should not be completely dominated by sample 
information. 


29.6.1 MLE 


We consider linear regression of log-earnings on an intercept, education, and 
age. The MLE for B when errors are 1.1.d. normal and homoskedastic is 
simply the OLS estimator. We have 


. * MLE for the regression (same as OLS with i.i.d. errors) 
. qui use mus229acs, clear 


. quietly keep if _n <= 100 


. regress lnearnings education age 


Source SS df MS Number of obs = 100 
F(2, 97) 7 7.52 

Model 7 .03491807 2 3.51745904 Prob > F = 0.0009 
Residual 45 .3428497 97 .467452059 R-squared a 0.1343 
Adj R-squared = 0.1165 

Total 52.3777678 99 .529068361 Root MSE = .6837 
lnearnings | Coefficient Std. err. t P>|t | [95% conf. interval] 
education .0852959 .0221804 3.85 0.000 .0412739 . 1293178 
age .0079952 .0064063 1.25 0.215 -.0047195 .02071 

_cons 9.246449 -4546021 20.34 0.000 8.34419 10.14871 


The returns to an additional year of schooling in this example are 8.5% with 
standard error 2.2%. Earnings increase by 0.8% for each additional year of 
age, after controlling for education, though the coefficient is statistically 
insignificant at the 5% level. 


29.6.2 Specifying the prior 


For Bayesian estimation, the likelihood is specified to be the normal with 
independent homoskedastic errors. This is a reasonable distributional 
assumption for log-earnings. 


As already noted, in this example, we try to use informative priors for G 
and g? wherever possible. 


For the coefficient of education, we use a prior that is consistent with 
the belief that with probability 0.95 earnings rise by between 4% and 8% for 
each additional year of education. Then an obvious prior for the education 
coefficient is the normal with mean 0.06 and standard deviation 0.01 
because the probability of being within two standard deviations of the mean 
is 0.95 for a normally distributed random variable. So the prior for Geq is 
N (0.06, 0.017). 


For age, we might believe that with an additional year of aging, earnings 
rise by between 0% and 4% with probability 0.95, so the prior for Bage is N 
(0.02, 0.012). 


For the intercept, we really have no prior information. The simplest 
approach is to specify a flat prior (here also a Jeffreys prior). An alternative 
is anormal prior with large variance, but great care is needed to ensure the 
variance is large enough to guard against a poor choice of prior mean, yet 
not so large that the MH algorithm performs poorly. For example, a prior for 
Pintercept Of N(0.00, 1.00) is a poor choice given that the MLE of (intercept 
was 9.24. Here the prior for Gintercept is specified to be N (10, 102). For the 
error variance g2, we use an inverse-gamma prior, for illustrative purposes, 
though a Jeffreys prior or a flat prior would be simpler. If we believe that 
95% of individuals have earnings between $10,000 and $200,000, then 95% 
have log-earnings between 9.20 and 12.20. Assuming lognormality, this 
suggests a standard deviation of (9.20 — 12.20) /4 = 0.75 and hence a 
variance of approximately 0.5 in an intercept-only model. As presented in 
section 29.5.2, if we view the inverse-gamma prior for g2 as adding vo extra 
observations with average error variance o?, then the prior for g? is inverse- 
gamma with shape parameter 1 /2 and scale parameter a2 /2. Here we use 


a weak prior and set 1) /2 = 0.1 and use the inverse-gamma with parameters 
1 and 0.5. 


29.6.3 Bayesian analysis 


Command bayesmh with the aforementioned options yields the following 
results: 


* Bayesian posterior with informative priors: Normal for b, inv gamma for s2 
. bayesmh lnearnings education age, likelihood(normal ({var})) 
> prior ({lnearnings:education}, normal(0.06,0.0001)) 
> prior ({lnearnings:age}, normal(0.02,0.0001)) 
> prior ({lnearnings:_cons}, normal (10,100) ) 
> prior({var}, igamma(1,0.5)) 
> rseed(10101) saving(mcmcdraws_fullregress, replace) 
Burn-in ... 
Simulation ... 


Model summary 


Likelihood: 
lnearnings ~ normal (xb_lnearnings, {var}) 
Priors: 
{lnearnings:education} ~ normal(0.06,0.0001) (1) 
{lnearnings:age} ~ normal(0.02,0.0001) (1) 
{lnearnings:_cons} ~ normal(10,100) (1) 


{var} ~ igamma(1,0.5) 


(1) Parameters are elements of the linear form xb_lnearnings. 


Bayesian normal regression MCMC iterations = 12,500 
Random-walk Metropolis-Hastings sampling Burn-in = 2,500 
MCMC sample size = 10,000 

Number of obs = 100 

Acceptance rate = . 1958 

Efficiency: min = 05397 

avg = 06435 

Log marginal-likelihood = -111.28922 max = .08116 


Equal-tailed 


Mean Std. dev. MCSE Median [95% cred. interval] 

lnearnings 
education .0647151 -0088105 .000379 .0647482 .0475785 .0824178 
age -0108165 -0053098 .000186 .0111341 .0001859 -0210965 


_cons 9.404009 . 2773838 .010749 9.398989 8.869306 9.980771 


var -4810144 .0691491 002931 -4731433 . 3637152 .6401572 


file mcmcdraws_fullregress.dta not found; file saved. 


. estimates store fullregress 


The posterior means for education and age are, respectively, 0.065 and 
0.011 compared with the ML estimates of 0.085 and 0.008. The 
corresponding posterior standard deviations are 0.0088 and 0.0053 compared 
with the ML standard errors of 0.0221 and 0.0064, so the informative prior 
for education in this example greatly improved precision. The posterior 
mean of g2 is 0.481 compared with 52 = 0.6842 = 0.468 using the reported 
root MSE in the output from command regress. 


The average sampling efficiency across the four parameters is 0.0644. 
Command bayesstats ess provides the efficiency of the MH draws for each 
parameter. 


. * MCMC statistics for all parameters 
. bayesstats ess 


Efficiency summaries MCMC sample size = 10,000 
Efficiency: min = 05397 
avg = . 06435 
max = .08116 
ESS Corr. time Efficiency 

lnearnings 
education 539.68 18.53 0.0540 
age 811.58 12.32 0.0812 
_cons 665.97 15.02 0.0666 
var 556.64 17.97 0.0557 


The column Ess gives the effective sample size. An Ess of 539.68 for Gea 
means that the 10,000 correlated MH draws of Beq are equivalent to 540 
independent draws; see [BAYES] bayesstats ess for details on computation. 
The column Efficiency gives ESS/§, where § is the number of MH draws. 
The column corr. time is the reciprocal of Efficiency. It is desirable to 
have efficiency greater than 0.10, but in practice efficiencies as low as 0.01 
may be acceptable. Here the efficiency rates range from 0.05 to 0.08. 


Command bayesgraph can be used to give the same four diagnostic 
graphs already presented in the 1.1.d. case for each of the model parameters. 
The following command gives these graphs for the slope coefficient of 
variable education. 


. * Diagnostic plots for MH posterior draws of beta_education 
. bayesgraph diagnostics {lnearnings: education} 
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Figure 29.4. Diagnostics for posterior draws of education 
coefficient 


The plots in figure 29.4 suggest that the chain has converged. However, 
the bottom left panel indicates considerable correlation in the draws, leading 
to low efficiency for the MH algorithm. 


A specific diagnostic plot can be presented for several or all parameters. 
Here this is illustrated for the trace plot. 


. * Trace plot for all four parameters 
. bayesgraph trace _all, combine 


The traces for all four parameters given in figure 29.4 seem reasonable. 


Trace plots 
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Figure 29.5. Trace for all four coefficients 


A very useful visual diagnostic is to compare the first and second half of 
all the Mcmc draws for all parameters; an example is given in section 29.6.5. 


29.6.4 A supposed uninformative prior can be informative 


It is easy to mistakenly specify a desired uninformative prior to be 
informative. A poor choice of prior for any single parameter, including 
nuisance parameters, can impact not only the posterior for that parameter but 
also the posterior for parameters of intrinsic interest. 


Suppose we have no idea about where the intercept should be centered 
and specify an N (0, 1.07) prior for the intercept, along with priors of 
N (0.06, 0.17) and N (0.02, 0.1) for education and age. The prior for the 
intercept is seemingly less tight than that for the slope coefficients because 
the standard deviation is 10 times larger. 


With this new prior, we obtain 


. * Bayesian posterior with prior for intercept tighter and centered on zero 
qui bayesmh lnearnings education age, likelihood(normal ({var}) ) 
rseed(10101) prior({lnearnings:education}, normal(0.06,0.01)) 
prior ({lnearnings:age}, normal(0.02,0.01)) 
prior ({lnearnings:_cons}, normal(0,1)) 
prior({var}, igamma(1,0.5)) 


VVVWVMs 


. bayesstats summary 


Posterior summary statistics MCMC sample size = 10,000 


Equal-tailed 


Mean Std. dev. MCSE Median [95% cred. interval] 

lnearnings 
education . 1495105 .022584 .001 . 1481057 . 1070548 . 1944438 
age .026712 .0070295 . 000329 .0265557 .0143138 .0402314 
_cons 7.510787 -4769094 .02173 7.518329 6.541839 8.308205 
var -5505413 -0918929 .005451 . 5368902 - 4038858 . 7633164 


The posterior mean for the intercept has changed somewhat, from 9.41 to 
7.51, and is still a long way from the prior mean of 0. But the returns to 
education and to age have more than doubled from 0.065 to 0.149 and from 
0.011 to 0.027! 


This susceptibility of posterior results to specification of the prior cannot 
be overemphasized. When an informative prior is used, there should be a 
justification for the priors on all parameters, including seemingly innocuous 
parameters such as the intercept. Where there 1s little prior knowledge, the 
prior variance should be very large. And one should check the robustness of 
results to changes in the prior. 


29.6.5 A prior can lead to an unidentified model being identified 


Informative priors can aid in identification, though the source of 
identification may then be concealed. 


To illustrate this, we add as regressor the regressor educcopy, which is a 
duplicate of variable education. The model is then no longer identified. OLS 
regression using command regress leads to automatic dropping of one of 


education Or educcopy. 


Now suppose we perform Bayesian analysis as in section 29.6.3, adding 
the additional regressor educcopy with N (0.10, 0.017) prior (and dropping 
age for brevity). We obtain 


. * Bayesian posterior with same regressor appearing twice and informative prior 
. generate educcopy = education 


qui bayesmh lnearnings education educcopy, 


> rseed(10101) likelihood (normal ({var}) ) 

> prior ({lnearnings:education}, normal(0.06,0.0001)) 

> prior ({lnearnings:educcopy}, normal(0.10,0.0001)) 

> prior ({lnearnings:_cons}, normal(10,100)) 

> prior({var}, igamma(1,0.5)) 

. bayesstats summary 

Posterior summary statistics MCMC sample size = 10,000 

Equal-tailed 
Mean Std. dev. MCSE Median [95% cred. interval] 

lnearnings 
education . 0490347 .0092127 . 000324 .0487151 .0315036 . 0674347 
educcopy . 0886085 . 0093957 . 000406 .0884671 .0701305 . 1071469 


_cons 8.870992 . 1811948 . 008026 8.871312 8.528059 9.223388 


var .5123742 0778485 .003532 . 5052402 . 3821599 . 63845542 


The variables education and educcopy are now both identified with 
posterior means of 0.04903 and 0.0886 and posterior standard deviations that 
are relatively small. 


From output not given, the sampling efficiency for the three parameters 
ranged from 0.049 to 0.081, compared with the range 0.054 to 0.081 for the 
model of section 29.6.3. 


As a check of chain convergence, we plot the density of the posterior for 
the first and second half of the draws for each parameter. 


* Posterior density for first and second half of draws for all parameters 
. bayesgraph kdensity _all, show(both) combine 


Density plots 
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Figure 29.6. Density for each half of draws for all four 
coefficients 


Figure 29.6 suggests that the chain has converged. 
As a more rigorous test, we then run multiple chains, rerunning the 


previous bayesmh command with the additional option nchains (5). The 
bayesstats grubin command then yields 


* Re statistic from running five chains on preceding bayesmh command 
. bayesstats grubin 


Gelman-Rubin convergence diagnostic 


Number of chains = 5 
MCMC size, per chain = 10,000 
Max Gelman-Rubin Rc = 2.391745 
Rc 
lnearnings 
education 2.391745 
educcopy 2.379293 
_cons 1.002488 
var 1.002649 


Convergence rule: Rc < 1.1 


The chain has clearly not converged, because the Rc statistics for education 
and educcopy are much greater than 1.1. 


The Stata default is to use 2,500 burnin draws. Increasing the burn-in to 
20,000 draws, adding option burnin (20000) to the previous bayesmh 
command, and running five chains lead to chain convergence because, from 
output not given, the Rc statistics for the four parameters then range from 
1.0004 to 1.0012. 


This example makes it clear that informative priors can lead to model 
identification. In some cases, this may be desired, but from section 29.6.4, it 
is easy to inadvertently specify an informative prior. A robustness test is to 
make the priors less informative. In the current example, increasing the 
variances of the priors for education and educcopy from 0.0001 to 1.0 leads 
to the two variables having posterior standard deviation more than 10 times 
the posterior mean, an indication of extreme multicollinearity. 


29.7 Modifying the MH algorithm 


Stata uses a random-walk adaptive MH algorithm. The proposal density 
q(0*|0s—1) is the normal with mean 0s—; and variance matrix p? ®©, where 
pk and X, are updated every 100 Mu iterations. Various default settings for 
the adaptive MH algorithm can be changed using the options adaptation (), 
scale(), and covariance () of command bayesmh. 


Here we consider improving the efficiency of the MH algorithm by 
applying the algorithm within blocks and by using the Gibbs sampler. 


29.7.1 Blocking parameters 


In its default form, the adaptive MH algorithm acts on the entire vector of 
parameters 9. Computational efficiency can improve if the parameters are 
broken into blocks, and the MH algorithm is applied recursively across the 
blocks. Blocks should be formed of parameters that are likely to be 
reasonably correlated with each other, while parameter correlation across 
blocks should be very low. The adaptations of the proposal densities are then 
done separately for each block. 


A minimal blocking separates mean parameters 3 from variance 
parameters g2. 


. * MH with blocking: var in separate block 
qui bayesmh Inearnings education age, likelihood(normal ({var}) ) 


> prior ({lnearnings:education}, normal(0.06,0.0001)) 
> prior ({lnearnings:age}, normal(0.02,0.0001)) 
> prior ({lnearnings:_cons}, normal(10,100)) 
> prior({var}, igamma(1,0.5)) block({var}) rseed(10101) 
. bayesstats summary 
Posterior summary statistics MCMC sample size = 10,000 
Equal-tailed 
Mean Std. dev. MCSE Median [95% cred. interval] 
lnearnings 
education .0647583 .0092147 .000311 .0648271 .047091 . 0836823 
age .010574 .0055013 .00021 .0105884 -.0002566 .0213766 


_cons 9.415344 . 2905818 .009854 9.398902 8.853972 10.00916 


var -4785125 .0701401 .001664 . 4723414 .3593001 .6328165 


. bayesstats ess 


Efficiency summaries MCMC sample size = 10,000 
Efficiency: min = .06851 

avg = . 1052 

max = .1776 


ESS Corr. time Efficiency 


lnearnings 
education 876.03 11.42 0.0876 
age 685.13 14.60 0.0685 
_cons 869.56 11.50 0.0870 
var 1776.22 5.63 0.1776 
. di "Overall acceptance rate = " e(arate) 


Overall acceptance rate = .3449655 


The efficiency for g2 has tripled, from 0.0557 to 0.1776, while the 
efficiencies for the slope parameters have increased somewhat. 


29.7.2 Gibbs sampler within MH 


The Gibbs sampler is an McMc method that has the advantage of having a 
high acceptance rate (of one); see section 29.3.5. It can be used in Bayesian 
applications in cases where analytical results exist for one or more 
conditional posteriors. 


For the current example with inverse-gamma prior for g2, an analytical 
result is available for the conditional posterior for g? given 8 (it is also 
inverse-gamma with known formulas for the shape and scale parameters). 


Stata command bayesmh allows use of the Gibbs sampler for drawing o? 
given 3,_,, while the MH algorithm is used to draw (3, given Bs; and o?_,. 
We obtain 


* Hybrid MH with Gibbs sampling subcomponent 
qui bayesmh Inearnings education age, likelihood(normal ({var}) ) 


> prior ({lnearnings:education}, normal(0.06,0.0001)) 
> prior ({lnearnings:age}, normal (0.02,0.0001) ) 
> prior ({lnearnings:_cons}, normal(10,100)) 
> prior({var}, igamma(1,0.5)) block({var}, gibbs) rseed(10101) 
. bayesstats summary 
Posterior summary statistics MCMC sample size = 10,000 
Equal-tailed 
Mean Std. dev. MCSE Median [95% cred. interval] 
lnearnings 
education .0653215 . 0090728 .000351 .065111 . 0483262 . 0836858 
age .010581 .0053998 .000174 .0106941 -.0001928 .0214931 
_cons 9.411757 . 2821473 .009121 9.417786 8.859742 9.956663 
var .4779449 . 0686372 . 000747 .4714431 .3615059 .6279229 
. bayesstats ess 
Efficiency summaries MCMC sample size = 10,000 
Efficiency: min = .06697 
avg = .2758 
max = . 8444 
ESS Corr. time Efficiency 
lnearnings 
education 669.72 14.93 0.0670 
age 960.54 10.41 0.0961 
_cons 956.93 10.45 0.0957 
var 8444.33 1.18 0.8444 
. di "Overall acceptance rate = " e(arate) 


Overall acceptance rate = .66958304 


Now the efficiency for g2 is close to perfect, while there is some further 
improvement for the components of 68. 


29.8 RE model 


The treatment of the regression model to date has assumed 1.1.d. errors. In 
practice, errors are often grouped or clustered. We present Bayesian analysis 
of a RE model, with intercept that varies across groups, based on hierarchical 
priors. Richer multilevel models additionally allow slope coefficients to vary 
across groups. 


29.8.1 RE MLE 


We begin with ML estimation of the usual RE model for individual 7 in group 
j. Then, 


Yij = Bi + B2Tij + Vj + Eij 


where v; ~ N[0, 02] and €;; ~ N[0, 02]. 


The ML estimates of 3, 82,07, and g? can be obtained using command 


mixed. 


As an example, we adapt the model in section 29.6.3 to have a separate 
random intercept at each level of education rather than entering education 
as a continuous regressor. Thus, we regress Inearnings (Yij) on an intercept 
and age (ij) with grouping (j) on each discrete value of education. We 
obtain 


* Mixed-effects estimation of RE model 
. qui use mus229acs, clear 


. quietly keep if _n <= 100 


. mixed lnearnings age || education: , nolog 
Mixed-effects ML regression Number of obs = 100 
Group variable: education Number of groups = 11 
Obs per group: 
min = 1 
avg = 9.1 
max = 33 
Wald chi2(1) = 1.20 
Log likelihood = -107.28605 Prob > chi2 = 0.2730 
lnearnings | Coefficient Std. err. z P>Izl [95% conf. interval] 
age 0070023 .0063881 1.10 0.273 -.0055181 .0195228 
_cons 10.35348 . 3266064 31.70 0.000 9.713341 10.99362 
Random-effects parameters Estimate Std. err. [95% conf. interval] 
education: Identity 
var (_cons) . 1566638 . 1808768 0163009 1.505659 
var (Residual) . 4457641 .0708718 . 3264172 . 6087474 
LR test vs. linear model: chibar2(01) = 4.31 Prob >= chibar2 = 0.0189 


The estimated intercept and slope are, respectively, 10.35 and 0.0070, and 
the estimated error variances are 2 = 0.157 and G? = 0.446. 


For comparison with subsequent analysis, we additionally calculate the 
best linear unbiased predictors of the group-specific errors vj and the 
corresponding group-specific intercepts 61; = 9; + v; at each level of 


education. 


* Predict the random intercepts = intercept + RE 
. predict u0, reffects 


. by education, sort: generate tolist = (_n==1) 
. generate randomint = _b[_cons] + u0 


. list education u0 randomint if tolist, clean 


educat~n u0 random™ t 

1 (0) -.5186173 9.834861 
3 6 -. 1734726 10.18001 
4. 8 -.1500419 10.20344 
5. 9 -.1516948 10.20178 
6 10 . 4270733 10.78055 
7. 12 .0500333 10.40351 
40. 13 -.0934182 10.26006 
56. 14 -.2972531 10.05622 
62. 16 . 3830092 10.73649 
91. 18 . 1542939 10.50777 
99. 20 . 3700881 10.72357 


The predicted intercepts range from 9.83 to 10.78 and are highest for 
completed degrees (at 10, 12, 16, 18, and 20 years of education). 


29.8.2 Bayesian RE models using hierarchical priors 


The preceding RE model can equivalently be written as a model with 
intercept that varies with group j. The model is 


Yij = Bij ote + Eij 


where Pi; ~ N(81, 07) and Eij ~ N (0, a), 


A hierarchical prior for the intercept first specifies a prior for 61 ,, here 
the N (81,07) distribution. Then, priors are specified for the parameters 81 
and g? of the original prior for 615. 


We again consider regression of Inearnings on an intercept and 
age, where the intercept varies by level of education. 


A simple approach is to use default priors and apply the bayes prefix to 
the mixed command used in section 29.8.1. 


. bayes, showreffects rseed(10101): mixed lnearnings age || education: 


(output omitted ) 
The showreffects option leads to output that includes the random effects. 


The bayes: mixed command uses Gibbs sampling for regression 
coefficients and variance components and uses adaptive MH for the random 
effects. The command is restricted to normal priors for random effects and 
does not provide an ability to change this. 


The bayesmh command with the multilevel specification introduced in 
Stata 17 provides much more flexibility. The random effects with grouping 
on level of education can be specified as V[education]. Then, a variance 
component var vis created, and v[education] has an N(0, var_V) prior. 
Other capitalized letters such as u could instead be used. For illustrative 
purposes, we also specify the priors, rather than use defaults. The 
hierarchical priors are specified to be N(10, 10°) for _cons and inverse- 
gamma (0.001, 0.001) for var_o. In this example, the Gibbs sampler can be 
used throughout, so we use the block(..., gibbs split) option with all 
blocks of parameters (regression coefficients, variance components, and 
random effects) included. We obtain 


. * Bayesian hierarchical model using bayesmh with multilevel specification 


bayesmh Inearnings age V[education], rseed(10101) 
likelihood(normal ({var_0})) showreffects 
prior ({lnearnings:_cons}, normal(10,100)) 
prior ({lnearnings:age}, normal(0,0.01)) 
prior({var_0}, igamma(0.0001, 0.0001)) 
prior({var_V}, igamma(0.0001, 0.0001)) 
block({lnearnings:} {var_0} {var_V} {V}, gibbs split) 
Burn-in 2500 aaaaaaaaal000aaaaaaaaa2000aaaaa done 


VVVVVMVs 


Simulation 10000 ......... 1000......... 2000......... 3000......... 
> 4000......... 5000......... 6000......... 7000......... 
> 8000......... 9000......... 10000 done 


Model summary 


Likelihood: 
lnearnings ~ normal (xb_lnearnings,{var_0}) 
Priors: 
{lnearnings:_cons} ~ normal(10,100) (1) 
{lnearnings:age} ~ normal(0,0.01) (1) 
{V[education]} ~ normal(0,{var_V}) (1) 
{var_0} ~ igamma(0.0001,0.0001) 
Hyperprior: 


{var_V} ~ igamma(0.0001,0.0001) 


(1) Parameters are elements of the linear form xb_lnearnings. 


Bayesian normal regression 
Gibbs sampling 


Log marginal-likelihood 


MCMC iterations = 


Burn-in 


MCMC sample size = 
Number of obs = 
Acceptance rate = 


Efficiency: min 


avg 
max = 


12,500 
2,500 
10,000 
100 

1 
01551 
. 1382 
. 4035 


Equal-tailed 


Mean Std. dev. MCSE Median [95% cred. interval] 
lnearnings 
age .0064714 . 0070284 .000528 . 0062262 -.006898 .0204813 
_cons 10.38439 . 3732316 .029968 10.39792 9.60542 11.0615 
var_0 - 4700626 .07673 .001575 . 4615064 . 3426978 .6421693 
var_V . 2488358 . 2868117 .011141 . 16028 .0019257 1.014185 
V[education] 
(0) -.5035241 .4451703 .015694 -.438133 -1.521537 . 1366138 
6 -.186511 . 3900403 .007276 -.124512 -1.08712 .5032018 
8 -. 1568481 . 3800651 .005983 -.1008047 -1.04303 .5179899 
9 -. 1586843 .3846119 .006312 -.1023517 -1.039526 .5410226 
10 . 4690828 . 503758 .020851 .3648499 -.2273046 1.664463 
12 .0466093 . 2034901 .007634 .0305266 -. 342077 .4904817 
13 -.0795536 . 2194784 .006561 -.0698219 -.5219822 .3719066 
14 -.2596706 .2745591 .007767 -.2334558 -.849786 .2167593 
16 . 346646 . 2346539 .013663 .3302939 -.0315136 .8540507 
18 . 1459382 . 254649 .008329 .114742 -.3120736 . 7104607 
20 .3715104 . 3948125 .012248 .3129923 -.2345209 1.267298 


The posterior means of the intercept parameter and the age slope parameter 
are, respectively, 10.38 and 0.0065, compared with ML estimates of 10.35 
and 0.0070 from command mixea. And the posterior means of the model 
variances are 0.249 and 0.470 compared with ML estimates of 0.157 and 
0.446. The corresponding posterior standard deviations are within 20% of 
the ML standard errors, aside from the variance of the random intercept 


(var v). 


The average sampling efficiency of 0.1382 is good. Note that RE models 
can introduce many more parameters (here 12), leading to computational 
challenges so that greater care is needed to confirm that the chain is likely to 
have converged. The acceptance rate is 1 because in this special case, all 
conditional distributions were available, so only the Gibbs sampler is used. 


The preceding command can be adapted to specify alternative prior 
distributions. For example, the option prior ({lnearnings: cons}, 
t(10,100,5)) specifies a Student’s ¢ prior for the intercept with mean 10, 
squared scale 100, and 5 degrees of freedom. In that case, the default 
random-walk MH sampling needs to be used because Gibbs sampling is no 
longer possible. 


29.9 Bayesian model selection 


The principal tools for Bayesian model selection are Bayes factors and the 
posterior odds ratio. These can be applied to either nested or nonnested 
models. 


29.9.1 Bayes factors 


Bayes factors are based on the marginal likelihood 
m(y|X) = f L(yl0, X) x m(6)d8. 


Let m1(y|X) and m2(y|X) denote the marginal likelihoods of 
models 1 and 2. Then the Bayes factor is defined as 


mi(y|X) 


Bayes factor = Bio = ——~——~ 
 ma(y[X) 


Model 1 is preferred to model 2 if B,, > 1. A commonly used rule of 

thumb is that evidence against model 2 is weak if 1 < By < 3, substantial 
if 3 < By < 20, strong if 20 < Bi2 < 150, and very strong if Bi2 > 150. 
This rule of thumb does not account for the relative size of the two models. 


However, the marginal likelihood m(y|X) is difficult to compute— 
recall that McMc methods could obtain draws from the posterior without 
computing the marginal likelihood. Several methods have been proposed to 
numerically estimate the marginal likelihood; see Kass and Raftery (1995) 
and references in Cameron and Trivedi (2005, 457—458). Stata command 
bayesmh uses the Laplace—Metropolis estimator 


M = (2)?/2|8) p (x18) T (6) 


where ¢ is the posterior sample variance and g is the mode of the posterior 
density. This quantity 77, is included in the output from command bayesmh. 
It can always be computed given Mcmc draws from the posterior, even when 
the posterior is improper. Considerable care is therefore needed. In 
particular, Koop (2003, 42) states that noninformative priors should be used 
only for parameters common to both models; otherwise, informative priors 
should be used. 


As an example, we compare two nested models, one with both 
education and age as regressors and one with just age as a regressor. The 
same priors are used for the parameters common to both models. The 
results for the larger model were already saved earlier. For the smaller 
model, we have 


* Bayesian posterior for small model without regressor education 
qui bayesmh lnearnings age, likelihood(normal ({var}) ) 

prior ({lnearnings:age}, normal(0.02,0.0001)) 

prior ({lnearnings:_cons}, normal(10,100)) 

prior({var}, igamma(1,0.5)) 

saving (mcmcdraws_smallregress, replace) rseed(10101) 


VVVM: 


. estimates store smallregress 


. di "Marginal likelihood for smaller model " e(1ml_1m) 
Marginal likelihood for smaller model -117.77778 


The Bayes factor can be computed using command bayesstats ic. 
Here we compare just two models, but several models can be listed. The 
default is for the base model to be the first listed model. Here we use option 
bayesfactor to have the Bayes factor listed; the default is to instead print 
the natural logarithm of the Bayes factor. 


* Bayes factor with first-listed model the base model 
. bayesstats ic fullregress smallregress, bayesfactor 


Bayesian information criteria 


DIC _— log (ML) BF 


211.5527 -111.2892 


fullregress . 
225.06 -117.7778 .0015207 


smallregress 


Note: Marginal likelihood (ML) is computed 
using Laplace-Metropolis approximation. 


The Bayes factor is much less than one, very strongly favoring the base 
model, which includes both education and age as regressors. Note that 
Bo, = Mmə/ m; = e711" /e~ "1-29 — 0.00152 < 1/150. The output also 
includes the deviance information criterion (DIC), which is based on the 
deviance statistic for the likelihood; see [BAYES] bayesstats ic for details. 
Models with smaller values of Dic are preferred, so the first model is 
preferred. In large samples, it can be shown that one half the difference in 
DIC across two models equals the natural logarithm of the Bayes factor. Here 
A DIC/2 = —6.75 and — In(0.0015207) = —6.49. 


29.9.2 Posterior odds ratio 


Additionally, one may have prior probabilities pı and po = 1 — pı for 
models 1 and 2. In that case, the Bayes factor is weighted by the ratio of 
these prior probabilities to give the posterior odds ratio 


X 
Posterior odds = B12 x mu te TYI) x K 


P2 m2 (y|X) P2 


Command bayesstats model can be used to compute the posterior 
odds ratio. In this example, we place prior odds of 0.8 on the larger model 
and 0.2 on the smaller model. The larger model is clearly favored because it 
has P (m| y) much greater than 0.5. 


. * Posterior odds Bayes factor with first-listed model the base model 
. bayestest model fullregress smallregress, prior(0.8 0.2) 


Bayesian model tests 


log (ML) P(M) P(Mly) 
fullregress -111.2892 0.8000 0.9996 
smallregress -117.7778 0.2000 0.0004 


Note: Marginal likelihood (ML) is computed using 
Laplace—Metropolis approximation. 


29.10 Bayesian prediction 
Most prediction methods give a single prediction for an observation. 


Bayesian methods can be used to obtain a posterior predictive distribution 
that provides a distribution, not just a single prediction. 


29.10.1 Posterior predictive distribution 


Bayesian analysis treats both y and knowledge of Q as random. Given a 
specified likelihood function L(y|@) and prior 7(@), the marginal density of 
y or marginal likelihood is then p(y) = f p(y, @)d0 = f L(y|@)7(@)d@. 


Suppose we wish to predict new observations y"*” conditioning on 
existing observations y°>s. The posterior predictive distribution of y"*” 
given y°s is obtained as 


p Gre) = Je (ye, Ay°Ps) do 


= f» (3°10, y™) p (Aly?) a0 
= J E(y™™]6)p (ly) a0 


where the last equality assumes independence between y”°™ and y°Þs 
conditional on @. 


Here L(y”°™ |0) is the likelihood for the new observations, and 
p(aly°”*) is the posterior. By making B Mcmc draws from the posterior 
p(6|y°”*) and computing L(y"°”|@) at each of these draws, we obtain B 
draws from the posterior predictive distribution of y”°™ given y°>s. 


For regression, the posterior predictive distribution is 


py yore: p = J L(y" |6, Xn (Alyr’®, X°P) dé 


and McMc draws are obtained similarly. 
29.10.2 The bayespredict command 


The Bayesian prediction commands bayespredict and bayesstats 
ppvalues are available only after bayesmh; they are not available after the 
simpler bayes prefix. 


The bayespredict postestimation command obtains many MCMC draws 
for each observation and stores them in a file, 


bayespredict ysimspec | ysimspec P| [ of | [in], saving (filespec) | stmopts | 


where in the simplest case ysimspec is {_ysim}. More generally, one can 
specify a function to be predicted, in which case funcspec replaces ysimspec. 
The default is to obtain 10,000 draws. 


Often, one is interested in only a summary of the draws, such as the 
posterior mean for each observation. This can be obtained using the 
command 


bayespredict [ type | newvarspec lif] lin], mean 


To obtain the posterior median, standard deviation, and credible interval, 
replace mean with, respectively, median, sta, and cri. 


29.10.3 Bayesian prediction example 


Bayesian prediction has many purposes, including multiple imputation (see 
section 30.5) and model checking. 


Here we consider a simple example where predictions of log earnings are 
made for 2 individuals, 1 with 12 years of education and aged 40 years and 1 


with 16 years of education and aged 40 years. 


We fit the same model as in section 29.6.3 based on 100 observations. 


. * Bayesian posterior with informative priors: Normal for b, inv gamma for s2 
qui bayesmh lnearnings education age, likelihood(normal ({var})) 

rseed(10101) prior ({lnearnings:education}, normal (0.06,0.0001)) 

prior ({lnearnings:age}, normal(0.02,0.0001)) 

prior ({lnearnings:_cons}, normal (10,100) ) 

prior({var}, igamma(1,0.5)) saving(mcmcdraws_fullregress, replace) 


VVVNVs 


The explanatory variables for the 2 new observations are generated as 
follows, using knowledge that the original sample had 100 observations. 


* Create 2 new observations (numbers 101 and 102) for prediction 
set obs 102 

Number of observations (_N) was 100, now 102. 

. quietly replace education = 12 if _n == 101 

. quietly replace age = 40 if _n == 101 

. quietly replace education = 16 if == 102 


. quietly replace age = 40 if _n == 102 


The following command yields 10,000 Mcmc draws from the posterior 
predictive density for each of the 2 individuals. 


. * Obtain 10,000 MCMC predictions for each of the 2 individuals 
. bayespredict {_ysimi} if _n > 100, saving(mcmcpredict, replace) rseed(10101) 


Computing predictions . 


file mcmcpredict.dta not found; file saved. 
file mcmcpredict.ster not found; file saved. 


For the first individual, the 10,000 predictions of the actual value of log- 
earnings have mean 10.608, while the 10,000 predictions of the mean (x’ 3) 
have average 10.613. 


. * Summarize the MCMC predictions dataset 


»- preserve 


. use memcpredict, clear 


summarize 
Variable Obs 
_chain 10,000 
_index 10,000 
_ysimi_i01 10,000 
_ysim1_102 10,000 
_mui_101 10,000 
_mui_102 10,000 
_frequency 10,000 

. restore 


5000.5 
10.60773 
10.87181 
10.61325 


10.87211 
1 


2886 .896 
. 7054899 
.6986186 
.0728873 


0726094 
0 


7.29318 
7.716644 
10.33691 


10.61295 
1 


10000 
13.20271 
13.52072 
10.86419 


11.13556 
1 


The bayesstats summary command can be used to provide summaries of 
the posterior predictive distribution for the two observations. 


. * Summarize the MCMC predictions 
. bayesstats summary {_ysim} using mcmcpredict 


Posterior summary statistics 


MCMC sample size 


= 10,000 


Mean Std. dev. 
_ysimi_101 10.60773 . 7054899 
~ysimi_102 10.87181 -6986186 


MCSE 


007055 
. 006986 


Equal-tailed 


Median [95% cred. interval] 
10.60946 9.208361 11.99379 
10.87383 9.501282 12.2324 


Note that the 95% credible regions are very broad because they are for 
forecasts of the actual value of log earnings rather than for prediction of the 
conditional mean of log earnings; see section 4.2.5 for discussion of this 


important distinction. 


There is no need to first generate and save a file with the 10,000 MCMC 
draws if interest lies only in summary statistics such as the mean of the 
posterior predictive distribution. For example, 


* Directly obtain the mean of the posterior predictive distribution. 
. bayespredict plnearnings if _n > 100, mean rseed(10101) 


Computing predictions 
. list plnearnings if _n > 100, clean 


plnear”s 
101. 10.60773 
102. 10.87181 


29.11 Probit example 


The same MH algorithm can be applied to nonlinear regression models such 
as the probit model. 


29.11.1 MLE 


We continue with the same dataset but consider probit regression where the 
dependent variable equals 1 if annual earnings exceed $75,000 and equals 0 
otherwise. 


* Create dependent variable for probit example 
. qui use mus229acs, clear 


. quietly keep if _n <= 100 
. generate dhighearns = earnings > 75000 
. quietly save musrearnings, replace 


summarize dearnings age education 


Variable Obs Mean Std. dev. Min Max 


dearnings 100 .2 . 4020151 (0) 1 
age 100 43.33 10.9342 25 65 
education 100 13.69 3.158106 (0) 20 


Here 20% of the sample has high earnings. 


We regress dhighearns on an intercept, education and age by ML using 
command probit. 


. * MLE for the probit regression 
. probit dhighearns education age, nolog 


Probit regression Number of obs = 100 
LR chi2(2) = 6.66 

Prob > chi2 = 0.0358 

Log likelihood = -51.778422 Pseudo R2 = 0.0604 
dhighearns | Coefficient Std. err. Zz P>|zl [95% conf. interval] 
education . 1184497 .055289 2.14 0.032 0100852 . 2268142 
age .0199199 .0138726 1.44 0.151 - .0072699 .0471097 

_cons -3.243121 1.071253 -3.03 0.002 -5.342738 -1.143505 


The slope coefficients are 0.118 and 0.020. From output given at the end of 
section 29.11.3, the corresponding average marginal effects (AMEs), obtained 
using command margins, are 0.035 and 0.006. So, for example, one more 
year of education is associated with a 0.035 increase in the probability of 
earnings exceeding $75,000. 


29.11.2 Bayesian analysis 


We next perform Bayesian analysis of the same model. 


The simplest approach is to use the bayes prefix with default normal 
uninformative priors, here bayes: probit dhighearns education age. 


Instead, we use the bayesmh command with flat priors. This gives 
essentially the same results. We obtain 


. * Bayesian posterior for probit regression with flat priors for beta 
. bayesmh dhighearns education age, rseed(10101) likelihood(probit) 

> prior({dhighearns:}, flat) saving(mcmcdraws_probit, replace) 
Burn-in ... 

Simulation ... 


Model summary 


Likelihood: 
dhighearns ~ probit (xb_dhighearns) 
Prior: 
{dhighearns:education age _cons} ~ 1 (flat) (1) 


(1) Parameters are elements of the linear form xb_dhighearns. 


Bayesian probit regression MCMC iterations = 12,500 


Random-walk Metropolis-Hastings sampling Burn-in = 2,500 

MCMC sample size = 10,000 

Number of obs = 100 

Acceptance rate = 2726 

Efficiency: min = . 08428 

avg = . 08626 

Log marginal-likelihood = -58.189797 max = .08879 
Equal-tailed 

dhighearns Mean Std. dev. MCSE Median [95% cred. interval] 

education . 128553 .0549554 .001844 . 1268172 .0218047 . 2425331 

age .0212355 .013818 . 000476 .0212071 -.0049327 .0497453 

-cons -3.45482 1.088155 .037168 -3.43342 -5.695359 -1.481793 


file mcmcdraws_probit.dta not found; file saved. 


The posterior means for education and age are, respectively, 0.1286 and 
0.0212, similar to the ML estimates of 0.1185 and 0.0199. The posterior 
standard deviations for education and age are, respectively, 0.0550 and 
0.0138, very close to the ML standard errors of 0.0553 and 0.0139. 


This result is expected. Because a noninformative prior is used, the 
Bayesian posterior means and standard deviations will be very close to the 
ML parameter estimates and standard errors. The Bayesian estimates, 
however, can be given a Bayesian interpretation. 


29.11.3 MEs 


From section 10.4.6, the probit coefficients can be directly interpreted up to 
a ratio because the probit model is a single index model. For example, the 
ME of one more year of education is roughly equivalent to that from 6 years 
of aging because the posterior mean of education is 6.05(= 0.1286/0.0212) 
times that of age. More precisely, we should compute the posterior mean of 
Bead Paws using command bayesstats summary. 


But this still leaves the question of the extent to which the probability of 
having high earnings changes. To obtain MEs requires additional coding after 
command bayesmh. 


The AME of a change in the jth regressor evaluated at the sth posterior 
draw 8, is 


1 N 
AMEjs = N N o(x/Bs) x Bis 
=l 


where ¢(-) is the standard normal density function. 


The AMEs can be calculated by applying the bayespredict command to 
a user-defined program that computes the AMEs. 


We first define a program we label ameprog. 


. * Program to compute AME = phi(x’b)*b_j to be used by bayespredict 
. program ameprog 
1. version 17.0 
2. args sum mu 
3. // Define locals with shorter names for convenience 
. local touse $BAYESPR_touse 
4. local theta $BAYESPR_theta 
5. local j $BAYESPR_passthruopts 
6. // Obtain the current MCMC value of the jth coefficient 
. tempname betaj 
7. scalar ~“betaj° = “theta’[1, ~j‘] 
8. // Compute phi(xb) and store it in temporary variable tmpv 
. tempvar tmpv 
9. generate double ~“tmpv” = normalden(invnormal(*mu’)) if ~touse’ 
10. summarize `tmpv“ if ~touse’, meanonly 
11. // Store the final result, AME, in temporary scalar sum 
. scalar “sum” = r(mean)**~ betaj~ 
12. end 


The program has two arguments: sum and mu. sum contains the name of a 
temporary scalar that stores the final result, here the AME. mu is a temporary 
variable that contains the linear prediction and is automatically filled in by 
bayespredict and passed to the program. Special global macros, 
SBAYESPR*, contain additional information that bayespredict passes to the 
program. In our example, we need the “touse” variable (SBAYESPR_touse), 
which marks the observations to be used in the computation; the MCMC 
coefficient values (SBAYESPR_ theta), which are provided by bayespredict 


in a temporary vector (one MCMC value at a time); and the additional options 

passed to the command and stored in global sBAYESPR_passthruopts. In our 
example, the additional option is the position of the coefficient for which we 
want to compute the AME. 


Next, we call bayespredict to compute AMEs for the three coefficients 
by using our own ameprog program; { mu1} refers to the linear predictor 


* Program that computes AMEs in each of 10,000 (the default) MCMC draws 
. bayespredict (ame1:@ameprog {_mui}, passthruopts(1)) 
> (ame2:@ameprog {_mui}, passthruopts(2) ) 
> (ame3:@ameprog {_mui}, passthruopts(3)), saving(amepred, replace) 


Computing predictions 


file amepred.dta not found; file saved. 
file amepred.ster not found; file saved. 


Finally, we use bayesstats summary to compute posterior summaries of 
the 10,000 Ames for the three regressors education, age, and the intercept. 


. * Summary statistics for the AMEs from 10,000 MCMC draws 
. bayesstats summary {ame1} {ame2} {ame3} using amepred 


Posterior summary statistics MCMC sample size = 10,000 


Equal-tailed 


Mean Std. dev. MCSE Median [95% cred. interval] 
amei -0361042 -0146516 -000503 .035981 -0062222 . 0646957 
ame2 .0059395 .0037616 .000131 .0059721 -.0013581 013301 
ame3 -.9684725 . 2683455 .009798 -.9728186 -1.471454 -.4514302 


The AME for education has posterior mean 0.0361, posterior standard 
deviation 0.0147 and 95% credible region [0.0062, 0.0647]. 


The Bayesian AMES are very similar to those obtained using classical 
methods after ML estimation, as expected given the use of flat priors in this 
example. 


. * Compare Bayesian AME to AME from ML estimation 
. qui probit dhighearns education age 


. margins, dydx(*) 


Average marginal effects Number of obs = 100 
Model VCE: OIM 


Expression: Pr(dhighearns), predict () 
dy/dx wrt: education age 


Delta-method 
dy/dx std. err. Zz P>/zl [95% conf. interval] 


education . 0345357 .0152421 2.27 0.023 .0046617 . 0644097 
age . 0058079 . 0039359 1.48 0.140 -.0019063 .0135222 


29.12 Additional resources 


The [BAYES] Stata Bayesian Analysis Reference Manual provides a very 
detailed exposition of Bayesian analysis and MCMC computation. The key 
command is bayesmh. Bayesian analysis is presented in many books, 
including Koop (2003), Gelman et al. (2013), and Cameron and 

Trivedi (2005, chap. 13). 


29.13 Exercises 


1. Consider the log-earnings example introduced in section 29.2 and 
analyzed further in later sections. Give command keep if n > 600 & 
gender==1 to restrict analysis to a subset of women. Perform analysis 
with default priors using bayes, rseed(10101): regress Inearnings 
education age hours. Note that hours is also a regressor. Do the 
statistics given in the output suggest any convergence problems? Find 
the probability that the coefficient of education exceeds 0.10. Compare 
the Bayesian estimates with ML estimates and comment. 

2. Continuing with the previous example, give command bayesgraph 
diagnostics separately for each of the four regressors and for the error 
variance (s igma2). And give command bayesgraph ac _all, 
combine. Finally, repeat the bayes command of question 1, adding the 
option nchains (5), and then give command bayesstats grubin. Do 
you see any problems? Explain. 

3. Continue with the same example as in questions 1 and 2, but use the 
bayesmh command, and specify the same informative prior as at the 
start of section 29.6.3, adding an N(0.2, 0.17) prior for the additional 
regressor hours. Compare estimates with the priors and with the 
estimates in question 1. 

4. Compare, on the basis of the Bayes factor, the model of question 1 with 
a smaller model that omits age and hours. Which model do you prefer? 

5. Suppose ¥ is exponential distributed with density f(y) = 6e—%Y and 
mean E(y) = 1/0. Let y = (y1,..., yn) be data from a random 
sample of size NM. Show that the likelihood is L(y|0) = 0% eo, 
where y is the sample mean. Show that the MLE is ĝ = 1 /y- Suppose the 
prior for Q is exponential with density 7(@) = 2e7?®, so the prior has 
mean £(@) = 1/2. Show that the resulting posterior density 
p(Oly) x 9N e7?(NY+1). One way to write a gamma density with 
parameters a and bis f(x) = T0 /Pia)}eete-@)” , where T (-) is the 
gamma function, the mean E(x) = ab, and variance Var(x) = ab?. 
Given this information, obtain the mean and variance of the posterior 
density for Q as a function of N and 7. 

6. Consider Bayesian analysis of the model y = X8 + u; u ~ N(O, I) 
with prior B ~ N (0, c x I), where the constant c is varied below and 


for simplicity g? — 1 is known. Then the posterior mean 
B= VH(X'X) t}! Bors + V ce where the posterior 


variance V is the inverse of the sum of the sample and prior precision 
matrices. Suppose we have a sample with (y, x) taking values such that 
in a model with intercept 


4 8 


1f 9 —4 16 
Iy ; 1 -1 __ . fos = 
x'x=| 5 A Ga ale a Ryle 


It will be easiest to do computations using matrix software in Mata or 
Stata. Find the MLE of 6 and its variance matrix. Find the posterior 
mean and variance of 3 if the prior for 3 is N (0, 0.01 x I). Find the 
posterior mean and variance of @ if the prior for G is N(0, 10 x I). 
Explain what we learn about the role of the strength of the prior on 8. 
Find the posterior mean and variance of 8 if the prior for 6 is N (0, 
0.01 x I), but we now have 100 times as many observations, so X’X 
and X’y are 100 times larger. Explain what we learn about the role of 
the sample size. Now, suppose the prior for G is instead uniform on £8. 
Show that the posterior for 8 is the normal with mean (X’ X x! y 
and variance (X’X)~*. (Hint: Use the result that 


(y — XB)'(y — XP) = WH + (6 — BV'X'X (B — P), where 


~ 


B = (X'X) X'y and â = y — X$) 


Chapter 30 
Bayesian methods: Markov chain Monte Carlo 
algorithms 


30.1 Introduction 


Bayesian methods were introduced in chapter 29, where focus was on the 
linear regression model and estimation used the bayes prefix and the 
bayesmh command. 


In this chapter, we provide greater detail on Bayesian Markov chain 
Monte Carlo (MCMC) methods, with application to the probit model. We 
begin by using the bayesmh command with a user-defined likelihood. We 
then obtain Mcmc draws from the posterior distribution for the probit model 
with flat prior using basic Mata code, rather than using the bayesmh 
command. Two Mcmc methods are illustrated—the random-walk 
Metropolis—Hastings (MH) algorithm and the Gibbs sampler with data 
augmentation. The latter method is possible for models such as probit and 
tobit that are based on underlying latent variables. 


The chapter concludes with a stand-alone treatment of multiple 
imputation of missing data. Multiple imputation is not necessarily Bayesian, 
and the initial discussion of missingness concepts and estimation and 
inference with multiply imputed datasets is relevant regardless of the 
imputation method used. A data example then presents one common 
method for imputation that uses MCMC draws based on data augmentation. 


30.2 User-provided log likelihood 


This section illustrates use of the 1levaluator() option, which allows use 
of the bayesmh command when the log-likelihood function is specified by 
the user. The same generated data are used throughout the chapter. 


30.2.1 Data and probit maximum-likelihood estimator 


Let the data on NV observations be denoted (y, X), where y is data on the 
dependent variables and X is data on exogenous regressors. And let @ 
denote the A parameters of the conditional density of y given X. 


For the probit model, the dependent variable Yi, given K regressors £i 


that include an intercept, takes value 0 or 1 with 


Pr(y; = 1|x;, 8) = ®(x,@) 


Pr(y; = 0|x;, 8) = 1 — ?(x;8) (30.1) 


where 3 isa K x 1 parameter vector and ®(-) is the standard normal 
cumulative distribution function. 


We generate data using the latent variable formulation for the probit 
model 


ye =x Btu 


© f1 ifye>0 (02) 
BETSY OF ay 0 


The following data-generating process (DGP) for a sample of size 100 
sets the intercept to 0.5 and the slope to 1.0, and the single regressor is 
standard normally distributed. 


. x Generate data N = 100 Pr[y=1|x] = PHI(O.5 + 1.0*x) and x ~ N(O,1) 
. set obs 100 
Number of observations (_N) was 0, now 100. 


. set seed 1234567 

. generate x = rnormal(0,1) 

. generate ystar = 0.5 + 1*x + rnormal(0,1) 
. generate y = (ystar > 0) 


. generate cons = 1 // Mata code below requires a regressor for the intercept 


. summarize 
Variable Obs Mean Std. dev. Min Max 
x 100 -.1477064 1.003931 -2.583632 2.350792 
ystar 100 .2901163 1.46373 -3.372719 3.316435 
y 100 .59 .4943111 (0) 1 
cons 100 1 (0) 1 1 


. save mus230bayesgenerated, replace 
(file mus230bayesgenerated.dta not found) 
file mus230bayesgenerated.dta saved 


The maximum likelihood estimator (MLE) is obtained using the probit 
command. 


* Fit probit model by MLE 
. probit y x, nolog 


Probit regression Number of obs = 100 
LR chi2(1) = 42.67 

Prob > chi2 = 0.0000 

Log likelihood = -46.350193 Pseudo R2 = 0.3152 
y | Coefficient Std. err. Zz P>l|z| [95% conf. interval] 

x 1.137895 . 2236915 5.09 0.000 .6994677 1.576322 

_cons -4810185 .1591173 3.02 0.003 . 1691543 . 7928827 


The ML estimates of the intercept and slope are, respectively, 0.481 and 1.138 
compared with the DGP values of 0.5 and 1.0. 


30.2.2 The bayesmh command 


We specify a flat prior for the regression parameters, so all values of 8 are 
equally likely, 


m() 


Even though this prior density is impro 
is proper. 


œ 1 


per, the consequent posterior density 


This model can be directly fit using command bayesmh, introduced in 


section 29.3.7. We obtain 


. * Fit probit model by command bayesmh 
. bayesmh y x, likelihood(probit) prior 
Burn-in ... 

Simulation ... 


Model summary 


({y:_cons x}, flat) rseed(10101) 


Likelihood: 
y ~ probit (xb_y) 
Prior: 
{y:_cons x} ~ 1 (flat) (1) 
(1) Parameters are elements of the linear form xb_y. 
Bayesian probit regression MCMC iterations 12,500 
Random-walk Metropolis—Hastings sampling Burn-in = 2,500 
MCMC sample size = 10,000 
Number of obs = 100 
Acceptance rate . 2081 
Efficiency: min = .09261 
avg = .104 
Log marginal-likelihood = -47.855029 max = .1154 
Equal-tailed 
y Mean Std. dev. MCSE Median [95% cred. interval] 
x 1.172479 . 2315767 .006817 1.155511 . 7693384 1.644086 
_cons 4912771 . 1649868 005421 .4913284 . 1694699 .8135934 


The posterior means of the intercept and slope parameters are 0.491 and 


1.172, close to the ML estimates of 0.48 


posterior standard deviations are 0.165 
standard errors of 0.159 and 0.224. We 


1 and 1.138. The corresponding 
and 0.232, compared with the ML 
expect some difference both because 


of randomness in the MH algorithm, and because even with a flat prior, the 
Bayesian results will differ from the ML results in finite samples. 


30.2.3 The bayesmh command with user-provided evaluator 


The llevaluator() option of the bayesmh command enables the user to 
provide the likelihood function in cases where these are not already available 
as options. 


The evaluator function is defined in a program that has as arguments 
inf, the log-likelihood function, and any parameters of the model. Single- 
indexes such as x’@ are treated as a single argument xb, similar to the 
programs for the mı command. 


The log-likelihood function, given independence over i, is the sum of the 
log densities for each observation. For the probit model defined in (30.1), the 
log density for the jth observation is 


E In{®(x,B)} ify >0 
In(yi|B, xi) = l ln {1 — ®(xjB)} ify <0 


Program probit11 defines the corresponding log-likelihood function. 


. * Define evaluator function for probit model for input to command bayesmh 
. program probitll 


1. version 17 

2. args lnf xb 

3. tempvar lnfj 

4. qui generate double “1lnfj° = ln(normal( ~xb°)) if $MH_y == 
5. qui replace “lnfj° = ln(normal(-~xb°)) if $MH_y == 

6. qui summarize ~“lnfj°, meanonly 

7. scalar “1lnf° = r(sum) 

8. end 


The official option likelihood (probit) for the bayesmh command is 
then replaced by option 1levaluator (probitl1), where program probitll 
defined the log-likelihood function. The prior, here a flat prior for 6B, is 
defined in the usual way for the bayesmh command. This yields 


. * Fit probit model by command bayesmh with user-provided evaluator 

. bayesmh y x, llevaluator(probit1l) prior({y:_cons x}, flat) rseed(10101) 
Burn-in ... 

Simulation ... 


Model summary 


Likelihood: 
y ~ probit1l1(xb_y) 
Prior: 
{yi cons x} ~ 1 (flat) (1) 
(1) Parameters are elements of the linear form xb_y. 
Bayesian regression MCMC iterations = 12,500 
Random-walk Metropolis-Hastings sampling Burn-in = 2,500 
MCMC sample size = 10,000 
Number of obs = 100 
Acceptance rate = . 1993 
Efficiency: min = . 1045 
avg = .1145 
Log marginal-likelihood = -47.877531 max = .1245 
Equal-tailed 
Mean Std. dev. MCSE Median [95% cred. interval] 


The efficiency and the acceptance rate are very similar to those for the 
example in section 30.2.2, which used option likelihood (probit). The 
posterior means of the intercept and slope parameters are 0.492 and 1.175, 
compared with 0.491 and 1.172. The corresponding posterior standard 
deviations are 0.161 and 0.229, compared with 0.165 and 0.232. 


These small differences arise because option 1levaluator() by default 
sets the initial values of the parameters to zero, whereas option 
likelihood (probit) uses different starting values. Alternative starting 
values can be specified using, for example, the option initial ({y: cons} 
0.5 (yim) La2). 


The manual entry [BAYES] bayesmh evaluators provides many 
examples. These include models with additional parameters, as well as 
additional coding in the program defining the evaluator to handle missing 
observations, using temporary marker variable $MH_touse. 


1.174587 . 2294219 . 006502 1.170978 . 7509662 1.654543 
.4919817 . 1611742 .004985 . 4946297 . 1892769 8195851 


The evaluator () option of the bayesmh command enables the user to 
provide both the likelihood function and the prior density for the parameters 
by defining the sum of the log likelihood and the prior density. Because 
command bayesmh provides a wide range of priors, including flat priors, 
there is generally less need to use this option; an exception would be 
providing the Jeffreys prior because this prior is determined by the 
likelihood function. 


30.3 MH algorithm in Mata 


In this section, we repeat the preceding example but directly code the 
random-walk MH algorithm in Mata rather than use the bayesmh command. 


30.3.1 The log posterior 


The likelihood function for the probit model can be written as 


N 


L(y|B,X) = [J (x18) {1 — O(a) 


w=1 


where (-) is the standard normal distribution function. We use a flat 
uninformative prior, so 


7(B) x1 


Combining the likelihood and the prior, we see the posterior is 


p(Bly, X) x L(y|B, X) x m(B) 


oc [[ 86x48)" {1 — eop x 1 


w=1 


N 
x [] & (xia) {1-8 (xa) }0™ 


i=1 
Note that the posterior p(G|y, X) is defined only up to a scale factor. 


The log posterior is then 


N 


Inp(Bly, X) œ S~ [yi x In {8 (x/8)} + (1 — ys) x In {1 — © (x1)}] 


w=1 


30.3.2 Random-walk Metropolis algorithm 


The random-walk MH algorithm is presented in section 29.3.3. Here we apply 
the algorithm to the probit model and implement the algorithm in Mata. 


At the sth round, the proposal distribution is the N ( BED) c71) 
distribution for specified c. So 8* = 3'S~1) + v, where v is a draw from the 
N (0, °T) distribution. 


The acceptance rule is determined based in part by the ratio of posteriors, 


rs = p (B3) /p (8°) 


We then set Bs = B* if us < rs or Bs = G,_, if Us > Ts, where Us is a draw 
from the uniform(0, 1) distribution. Taking logs, the acceptance rule is 
equivalent to setting Bs = 8% if Inu, < Inrs. 


For the probit posterior with flat prior defined above, we have 
Inrs = X [yin (x;8*) + (1 — y) In {1 — © (x;8*)} 
-Z [wn (wi) +a -ua fr -# (ea) 
Then 3, = G7 if Inu, < Inr,; otherwise, B, = G,_}. 
30.3.3 Numerical example in Mata 


We first define some global macros. The first 10,000 MH draws will be 
discarded (burn-in), and the next 10,000 draws kept. For draws from 


N (0, c?1), we set c = 0.25. 

. * Define globals for number of reps and the key tuning parameter 

. global s1 10000 // Number of retained reps 

. global sO 10000 // Number of burnin reps 

. global sdscale 0.25 // May need to change c in proposal b + $sdscale*N(0,I) 


The key tuning parameter is the constant c, which in the current example is 
set to c = 0.25. A rationale for this value is that ideally the proposal 
distribution is close to the posterior distribution. With a flat prior, the 
posterior distribution will be close to the log likelihood, and the ML estimates 
of the two model parameters had standard deviations of 0.22 and 0.16. 


The bayesmh command initially sets c = 2.38/K, following a suggestion 
made in several articles, where K is the number of model parameters in an 
MH block. In the current example, with c = 2.38/2 = 1.19, the chain 
diverged. The bayesmh command includes a rule for changing the value of c 
every 100 draws (the default), so it is more likely to overcome a poor initial 
value of c. And option scale () allows specification of a different initial 
value for c. 


The following lengthy Mata code, based on the MATLAB code of 
Koop (2003), implements the MH algorithm. 


* Mata to obtain the posterior draws of b for probit MH algorithm 
set seed 10101 


. mata 


VVVVVVV VV VV VV VV VV VV VM ite esee e 


mata (type end to exit) 


// (1) Create y vector and X matrix from Stata dataset using st_view() 


st_view(y=., ., "y") // Dependent 

st_view(X=., ., ("cons", "x")) // Regressors 

Xnames = ("cons", "x" // Used to label output 

// Calculate a few quantities outside the loop for later use 
n = rows(X) 

k = cols(X) 


ones = J(n,1,1) 


// Specify the number of replications 


s0 = 


si= 


s = 


$s0 // Number of burnin reps 
$s1 // Number of retained reps 


s0+s1 // Total reps 


// Store all draws and MH acceptance rate in the following matrices 
b_all = J(k,s1,0) 


accept_all = J(1,s1,0) 


// Initialization 
bdraw = J(k,1,0) // Starting b value is vector of zeros 


lpostdraw = -1*10°10 // Starting value of ln(posterior) is small 


// So accept initial MH draw 


// (2) Now do MH loop and make the posterior draws 
// Begin MH loop 


for 


(irep=1; irep<=s; irep++) { 

// Draw new candidate value of b from MH random-walk chain 

bcandidate = bdraw + $sdscale*rnormal (k,1,0,1) 

// Note: For different data, you may need to change the global sdscale 
// Best is bcandidate = bdraw + z, where z ~“ N(O, post. variance of b) 
// Compute the log-posterior at the candidate value of b 

// The assumed prior for b is uninformative 

// so the posterior is proportional to the usual probit likelihood 


probitprob = normal (X*bcandidate) 
lpostcandidate = ones“(y:*ln(probitprob) +(ones-y) :*ln(ones-probitprob) ) 


// 


Accept the candidate draw on basis of posterior probability ratio 
if uniform > (posterior(bcandidate) / posterior (bdraw) ) 

where bcandidate is current b and bdraw is previous b 

Taking logs the rule is the same as 

if I1n(uniform) > (lpostcandidate - lpostdraw) 


laccprobability = lpostcandidate - lpostdraw 
accept = 0 


if 


( 1ln(runiform(1,1)) < laccprobability ) { 
lpostdraw = lpostcandidate 

bdraw = bcandidate 

accept = 1 


> // Store the draws after burn-in of b and whether accept draw 

> if (irep>so) { 

> // After discarding burn-in, store all draws 

> j = irep-s0 

> b_all[.,j] = bdraw // These are the posterior draws 

> accept_all[.,j] = accept // These are one if new draw accepted 
> } 

> } 


// End MH loop 


// (3) Pass results back to Stata 

// The next command is needed for conformability. It requires $s1 > N 
: stata("set obs $s1") 
Number of observations (_N) was 100, now 10,000. 

accept = accept_all” 


st_addvar("byte", "accept") 


5 
st_store(., "accept", accept) 
beta = b_all™ 
// Loop sends each column of beta to Stata as separate variable beta_i 
: for (i=1; i<=k; i++) { 
> v = betal.,i] 
> vname = "beta" + strofreal(i) 
> st_addvar("double", vname) 
> st_store(., vname, v) 
> } 
6 
7 
: end 


The Mata code involves three segments. The first segment, denoted (1), 
reads in the data and initializes key constants and matrices. 


The second segment, denoted (2), implements the MH algorithm. Matrix 
bdraw contains the previous draw @,_ 1, matrix bcandidate contains the 
proposal draw 8%, and if the proposal draw is accepted, then matrix bdraw is 
updated. 


The third segment, denoted (3), passes the matrices beta and accept 
back to Stata as variables because some of the subsequent analysis is easier 
to do in Stata than in Mata. The code uses Mata functions st addvar() and 
st_store(). Mata function strofreal() is used to convert the real number 
index i to a string variable "i" so that, for example, a variable named bet a2 
is created when i=2. 


It would be much simpler to directly pass the matrices to Stata using 
Mata command st_matrix("beta", beta), and then in Stata to give 
command svmat (beta) to convert a matrix to variables. The reason for not 
taking this simpler approach is that while matrices in Mata can be very large, 
in Stata their size is restricted. For Stata/BE, the limit is 800 x 800, and even 
Stata/SE and Stata/MP restrict matrix size in Stata 17 to be no more than, 
respectively, 11000 x 11000 and 65534 x 65534. 


Command summarize provides a summary of the posterior draws and 
acceptance rate, and command centile gives the 95% Bayesian credible 
region. 


. * Analyze the posterior draws from probit MH algorithm 
. Summarize beta* accept 


Variable Obs Mean Std. dev. Min Max 
betal 10,000 .4868512 .1573034 -.0620936 1.173329 
beta2 10,000 1.167617 . 2263273 . 4205724 2.051684 

accept 10,000 . 4302 . 4951287 0 1 


. centile beta2, centile(2.5, 97.5) 


Binom. interp. 
Variable Obs Percentile Centile [95% conf. interval] 


beta2 10,000 . 7456415 . 726276 . 7578553 


1.631214 1.621459 1.63932 


Compared with the results from the bayesmh command given in 

section 30.2.2, the posterior means of the intercept and slope parameters are 
0.487 and 1.168, rather than 0.491 and 1.173. The corresponding posterior 
standard deviations are 0.157 and 0.226, compared with 0.165 and 0.232. 
The 95% Bayesian credible region for the slope parameter is [0.746, 1.631] 
compared with [0.769, 1.644]. 


Because the MH posterior draws are correlated, the 10,000 retained draws 
convey less precision than 10,000 independent draws. This loss is measured 
by the efficiency, the reciprocal of the sum of the squared autocorrelation 
coefficients of the draws; eff=1/{1 + 2 x ey! + p;)}. 


To compute the autocorrelations to J = 100 and hence the efficiency, we 
use the time-series commands in Stata as follows. 


* Compute the efficiency of the MH algorithm 
. generate s = _n 


. tsset s 


Time variable: s, 1 to 10000 
Delta: 1 unit 


. qui ac beta2, lags(100) gen(ac) 
. Qui generate ac_sq = ac^2 
. qui summarize ac_sq 


. di "Efficiency = " 1/(1+2*r(sum) ) 
Efficiency = .18277753 


The efficiency of 0.183 for Bə is good. 


We next construct various diagnostics graphs for 62, similar to those 
produced by the bayesgraph diagnostics command. 


* Plot various diagnostics for the posterior draws of b2 
. qui ac beta2, title("Autocorrelations") lags(100) 
> note(" ", ring(0) pos(3)) 


. qui line beta2 s if s < 100, title("Trace: First 100 draws") 
. qui line beta2 s, title("Trace") 


. qui graph twoway (kdensity beta2) (kdensity beta2 if s<=5000) 
> (kdensity beta2 if s>5000), title("Density: All, íst half, 2nd half") 
> legend(off) note(" ", ring(0) pos(3)) 


From figure 30.1, the diagnostic graphs all look good. The 
autocorrelations die out fast, many of the first 100 draws after the burn-in 
draws are accepted, the trace for all 10,000 retained draws shows no pattern, 
and the densities for the first and second halves of the retained draws are 
similar. 
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Figure 30.1. Diagnostics for posterior draws of education 
coefficient 


30.4 Data augmentation and the Gibbs sampler in Mata 


For Bayesian MCMC methods, it is usually advantageous to use the Gibbs 
sampler, presented in sections 29.3.5 and 29.7.2, wherever possible. 
However, to do so requires analytical results for some of or all the 
conditional posteriors. 


Analytical results are most readily available for the linear regression 
model under normality; some of these results were given in section 29.5. 
Several leading nonlinear regression models, notably, tobit, binary probit, 
and multinomial probit, are models based on latent variables that follow a 
linear normal model. As explained next, data augmentation methods enable 
application of the Gibbs sampler to these latent variable models. 


30.4.1 Data augmentation 


Latent variable models, such as probit and tobit models, can be modeled as 
observing yi,---, Yn based on latent variables y;,..., yxy. Furthermore, for 
many of these models, y7,..., yxy completely determines yi,..., yn. For 
example, for the probit model, y; = 1(y7 > 0). 


For simplicity, we initially suppress conditioning on regressors X and 
consider the general case with parameters Q. The desired posterior is p(6|y). 
Data augmentation expands this posterior to include the latent variables 
y* = (yj,---, yxy)’. The goal then is to make draws from the augmented 
posterior p(0, y*|y). Given draws of Q and y* from the joint posterior, the 
draws of y* are ignored, while the draws of @ are draws from the desired 
posterior p(@|y) because p(@|y) is the marginal posterior of Q with respect 
to the joint posterior of p(6, y* |y). 


The draws from the augmented posterior p(@, y* |y) can be obtained 
using the Gibbs sampler, with alternating draws from the conditional 
posterior p(0|y*, y) and the conditional posterior p(y*|0, y), assuming that 
these conditional posteriors are known. 


30.4.2 Gibbs sampler data-augmentation algorithm for probit 


Here we focus on the probit model. From (30.2), the latent variable is 
y; = x/Q + u; and we observe y; = 1 or 0 according to whether y* > 0. 


If the latent variable y* was observed and a flat prior is used for G, then 
Bayesian analysis is straightforward. Because y* ~ N(0, 1), the log 
. . . š N * 
likelihood is In p(Bly*, X) = —(N/2) In(2m) — (1/2) i, (yf — x48)”. 
Given prior density 7(3) « 1, the log-posterior density is actually known 
and is 


np(Bly*,X) = -2 n(2n) - 5 (uf - x18) 


Of course, the latent variable is not observed. The data-augmentation method 
treats the latent variables as N additional parameters yj, ..., yxy and makes 
random draws of yj,..., yv- 


Specifically, given knowledge of 8 and the data, the distribution of these 
additional parameters is known because if y; = 1, then y* = x/ 8+ u; > 0. 
So y* is a draw from the N(x‘, 1) distribution truncated from below at 0. 
Similarly, if y; = 0, then yj is a draw from the N(x;G, 1) distribution 
truncated from above at 0. 


The conditional posterior for the parameters 6 given the latent variables, 
p(Bly, y*, X) = p(Bly*, X); conditioning on y in addition to y* is 
irrelevant because knowledge of y* subsumes knowledge of y. The 
conditional posterior for the parameters 8 given the latent variables is 
p(Bly, y*, X). This simplifies to p(Gly*, X) because conditioning on y in 
addition to y* is irrelevant because knowledge of y* subsumes knowledge of 
Y: 


The sth round of the Gibbs sampler draws 8, from 
P(BS|Yj,s—1)-++> YN,s—1¥)-&) and then draws yf ,,---;YN,s from 


pyi- YN Bay, X). 


30.4.3 Numerical example 


We continue with the same probit example as estimated previously using the 
random-walk MH algorithm. 


We first define some global macros. 


. * Define globals for number of reps and the key tuning parameter 
. use mus230bayesgenerated, clear 


. global si 10000 // Number of retained reps 
. global sO 10000 // Number of burnin reps 


The following Mata code, based on the MATLAB code of Koop (2003), 
implements the Gibbs sampler with data augmentation. 


set seed 10101 


. Mata 


mata (type end to exit) 
// (1) Create y vector and X matrix from Stata dataset using st_view 


st_view(y=., ., "y") // Dependent 
st_view(X=., ., ("cons", "x")) // Regressors 
Xnames = ("cons", "x" // Used to label output 


// Calculate a few quantities outside the loop for later use 
n = rows(X) 


k = cols(X) 
Xsquare = cross(X,X) 
Xtxinv = invsym(Xsquare) 


Xtxinvchol = cholesky(Xtxinv) 


// Specify the number of replications 
sO = $s0 // Number of burnin reps 


s1 
s = s0+s1 // Total reps 


$s1 // Number of retained reps 


// Store all draws in the following matrices 
y1 = J(s0+s1,1,0) 


b_all = J(k,s1,0) 


// Prior for beat is noninformative 
// Choose a starting value for latent data 
ystar = y 


// (2) Now, do Gibbs sampler loop and make the posterior draws 
for (irep=1; irep<=s; irep++) { 
// Posterior step: draw from beta | y* ~ N[bols*, (X°X)*-1] 
// This is using noninformative prior for beta 
bols = Xtxinv*cross(X,ystar) 
bi = bols 
bdraw = bi + Xtxinvchol*rnormal(k,1,0,1) // invnormal(uniform(k,1)) 
// Imputation step: make one draw of vector ystar 
// where for ith observation ystar_i | y,b is truncated normal 
// Right: If y = 1, we need draw from truncated N[0,1] with ystar > -mu 
// Left: If y = 0, we need draw from truncated N[0,1] with ystar < -mu 
for (i=1; i<=n; i++) { 
mu = X[i,.]*bdraw 
if (y[i,1]==0) { 
uright = normal (-mu)*uniform(1,1) 
ystar[i,1] = mu + invnormal (uright) 


} 

else { 
uleft = normal(-mu) + (1-normal(-mu))*uniform(1,1) 
ystar([i,1] = mu + invnormal(uleft) 

} 


// Store the draws of b after burn-in plus a diagnostic used in Koop 
if (irep>sO) { 
// After discarding burn-in, store all draws 
j = irep-s0 
b_all[.,j] = bdraw // These are the posterior draws 
} 
} 


// End Gibbs loop 


VVVVVV VV VV VV VV VV VV VV VV VV VV WV 


// (3) Pass results back to Stata 

// The next command is needed for conformability. It requires $s1 > N 
: stata("set obs $s1") 
Number of observations (_N) was 100, now 10,000. 


beta = b_all’” 


// Loop sends each column of beta to Stata as separate variable beta_i 
for (i=1; i<=k; i++) { 


> v = beta[.,i] 
> vname = "beta" + strofreal(i) 
> st_addvar("double", vname) 
> st_store(., vname, v) 
> } 
5 
6 


end 


The first and third segments are essentially the same as for the MH algorithm 
example in section 30.3.3. The second segment, labeled (2), implements the 
Gibbs sampler. 


We obtain the following results. 


. * Analyze the posterior draws from probit Gibbs sampler algorithm 
. summarize beta* 


Variable Obs Mean Std. dev. Min Max 
betal 10,000 .4842779 .1582509 -.2082249 1.158187 
beta2 10,000 1.173542 . 2259341 .4758451 2.346783 


centile beta2, centile(2.5, 97.5) 


Binom. interp. 


Variable Obs Percentile Centile (95% conf. interval] 
beta2 10,000 2.5 . 7512463 . 7392763 . 7636364 
97.5 1.646951 1.629367 1.660316 


Compared with the results from the bayesmh command given in 

section 30.2.2, the posterior means of the intercept and slope parameters are 
0.484 and 1.174, rather than 0.491 and 1.172. The corresponding posterior 
standard deviations are 0.158 and 0.226, compared with 0.165 and 0.232. 
The acceptance rate is automatically one for this method that uses only the 
Gibbs sampler throughout. The 95% Bayesian credible region for the slope 
parameter is [0.751, 1.647] compared with [0.769, 1.644]. 


We again compute the efficiency of the method. 


. * Compute the efficiency of the Gibbs sampler algorithm 
. generate s= _n 


. tsset s 


Time variable: s, 1 to 10000 
Delta: 1 unit 


. qui ac beta2, lags(100) gen(ac) 
. qui generate ac_sq = ac”2 
. qui summarize ac_sq 


. di "Efficiency = " 1/(1+2*r(sum) ) 
Efficiency = .20486695 


The efficiency of 0.205 is quite similar to the efficiency obtained using the 
MH algorithm. 


We again construct various diagnostics graphs for 3 similar to those 
produced by the bayesgraph diagnostics command. 


. * Plot various diagnostics for the posterior draws of b2 
. qui ac beta2, title("Autocorrelations") lags(100) 
> note(" ", ring(0) pos(3)) 


. qui line beta2 s if s < 100, title("Trace: First 100 draws") 
. gui line beta2 s, title("Trace") 
. qui graph twoway (kdensity beta2) (kdensity beta2 if s<=5000) 


> (kdensity beta2 if s>5000), title("Density: All, ist half, 2nd half") 
> legend(off) note(" ", ring(0) pos(3)) 
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Figure 30.2. Diagnostics for posterior draws of education 
coefficient 


The diagnostic plots all look good. The trace for the first 100 retained draws 
illustrates that a different posterior draw of the parameters is obtained at 
each round, so the acceptance rate is one. 


30.4.4 Further examples 


Data augmentation can be applied to a range of models, especially those 
based on latent normal variable models. A good reference is Chib (2001). 


The preceding example with a flat prior can be modified to allow a 
normal prior for 3, in which case the conditional posterior p(G|y*, X) is 
also normal. 


A more advanced example is the multinomial probit model. The random 
utility for the jth of m alternatives is Uj; = x}, + €:j;, where the m x 1 
error vector e; ~ N(0, X+). The jth alternative is chosen, so yj; = 1, if 
Uš% > Uj, allk A j. A Bayesian analysis might specify a normal prior for 8 
and a Wishart prior for > aes Data augmentation introduces the latent 
utilities U; = (Ui1,..., Uim) as additional parameters. Let 
U = (U,,..., Un) and y = (y1,..., yn). The Gibbs sampler for the joint 
posterior p(3, U, &.|y, X) then cycles between (1) the conditional posterior 
for B|U, £+, y, X; (2) the conditional posterior for 4.|G, U, y, X; and 
(3) the conditional posterior for U|G, X+, y, X. See McCulloch and 
Rossi (1994) for details and application. 


30.5 Multiple imputation 


Imputation methods were briefly mentioned in sections 2.4.6 and 19.10.5. 
Here we consider multiple imputation, where missing observations are 
imputed several times. 


30.5.1 Missingness mechanisms 


Let W denote an N x p data matrix, let W ons denote the observed part of 
W, and let W miss denote the missing part of W. The goal of multiple 
imputation is to impute values of W miss from Ws. This requires 
specification of a stochastic model for W or for components of W. The 
strength of assumptions made varies with the way in which data are 
missing. 


Data are missing at random (MAR) if the cause of missing data is 
unrelated to missing data, though is related to observed data. Let § denote 
an N x p matrix of indicators with elements of zero or one depending on 
whether corresponding values in W are missing. Then data are MAR if Pr 
(S| Woops, Wmis) = Pr(S|W ops). This situation arises in a regression 
context where analysis is conditional on regressors and the only missing 
data are those on exogenous regressors. 


If instead the cause of missing data is unrelated to either observed data 
or missing data, then data are said to be missing completely at random 
(MCAR). Formally, Pr(S|W obs, Wmis) = Pr(S). While consistent estimation 
and valid inference is then possible using only nonmissing data, imputing 
MCAR data can improve efficiency. 


A more challenging situation is where the cause of the missing data is 
related to missing data. Then Pr(S|W ps, Wmis) Æ Pr(S|W ops), and the 
data are said to be missing not at random (MNAR). This assumption applies 
in a regression context where data are missing on endogenous regressors or 
on the dependent variable. 


Simple imputation methods such as replacing missing values with 
nonmissing mean values can lead to inconsistent estimation and invalid 
statistical inference; the chapter exercise 5 provides an example. Multiple 
imputation can avoid these problems, provided the model used to impute 
missing data is a good model. Multiple imputation is best used in situations 
where data are MAR or MCAR. In principle, one can also impute when data 
are MNAR, but then the necessary stochastic assumptions are much stronger. 


30.5.2 Estimation and inference for multiple imputation 


A single imputation creates data W misximp that are imputed values of the 
missing data W mis- Then the completed imputed dataset is 

Wimp = (Wobs, W misximp) - Given Wimp, we can obtain an estimate g of 
0 in the usual way. But regular statistical inference based on g will be 
invalid because it fails to account for the additional randomness caused by 
imputing W misximp- 


Multiple imputation imputes the missing data m times, leading to m 
independent imputations W misximp,r, r = 1, ..., m and hence m completed 
imputed datasets Wimp,r, r = 1,..., M. 


Each dataset is used to fit a model with parameter @, leading to m 
estimates 61, ead Om: Each estimate ĝ, has two sources of randomness— 


the usual sampling variability measured by the usual estimated variances 
V, plus additional randomness due to imputation. 


The parameter Q is estimated by A = 6, the average of the 
t r= 
m estimates 9,,...,0,,,- The variance of ĝ is estimated by the sum of the 
average of V,,..., V, Plus the variance in @,,..., @,,, across the m 


imputed samples. The latter quantity accounts for the additional variation 
due to imputation. Then the estimate g has estimated variance 


where use of {1 + (1/m)}/(m — 1) rather than the more obvious 
1/(m — 1) is a finite-imputation correction. 


The need to impute leads to an efficiency loss compared with having all 
data available. This efficiency loss is greater the smaller is m and the more 
data that are missing. Additionally, this efficiency loss varies with the 
model used and the particular data at hand. So there is no clear best value 
for m. A conservative choice for researchers doing their own imputations 
might be m = 20. 


Many major individual-level datasets impute missing data for at least 
key variables such as income. Typically, there is only a single imputation. 
Some surveys instead publish several imputed datasets to enable users to 
control for the additional uncertainty due to imputation. Then low values 
such as m = 5 are typically used. 


30.5.3 Regression-based imputation 


One method for data imputation is regression-based imputation for 
observations that are missing on a single variable. 


We consider a simple example where data are missing for some 
observations of the variable x and are completely available for all 
observations of variables z. Thus, a completed dataset would be 
W = (x, Z). Partition missing and nonmissing observations on variable x 
as X = (Xo, Xm) and correspondingly partition observations on variable z 
as Z = (Zo, Zm). 


Suppose x is a count that we believe can be well modeled 
parametrically as being Poisson distributed with mean exp(z' 6). Then 
regression-based imputation of Xm is obtained as follows. 


First, use nonmissing observations to perform Poisson regression of Xo 
on Zo, giving estimate B with robust variance estimate Ų. Second, make a 
draw 3* from the N(B, V) distribution, and then for each observation ; of 
the Nmiss Observations missing data on x, make a draw from the Poisson 
distribution with mean exp(zi, B): where Zm: denotes the jth entry in Zm- 


This yields one set of Nmiss imputed values on Xm. To obtain additional 
sets of imputed values, repeat the second step. 


This imputation method can be given a Bayesian interpretation. Then 
8* is a draw from N(@, V), which is the asymptotic approximation to the 
posterior distribution of 3 given the noninformative prior 7(() œ constant. 


This method can be adapted to impute missing values for a single 
variable by using other parametric models, such as logit for binary data or 
normal with independent and identically distributed errors for continuous 
data. To make parametric assumptions more reasonable, the imputation 
might be applied to a transformation of x. For example, if x is right skewed, 
then first impute ]n g, and then set x = exp(ln x). 


Note that if the variable x with missing data is to be ultimately included 
in a regression with completely observed dependent variable y, then the 
variables z used in imputing x should include y. Variables x and y are 
related, and we have made the assumption of MAR. 


30.5.4 Imputation by data augmentation 


We now consider simultaneously imputing several variables with missing 
values, under the assumption of joint normality of all variables. The 
imputation method is a Bayesian method that uses data augmentation to 
treat the missing values as parameters and then imputes these missing 
values as MCMC draws from the posterior. 


Partition w into (x, z), where x denotes one or more incompletely 
observed variables and z denotes completely observed variables to be used 
in imputation. Suppose that x; ~ N(IIz;, ©), where TI is a matrix of 
unknown coefficients and X is an unknown variance—covariance matrix. 
Because normality is assumed, it may be necessary to first transform some 
variables, such as taking the natural logarithm for right-skewed positive 
data. 


For the incompletely observed variables x, consider the partition 
X = (Xo, Xm), where X, contains observed values and X,,,, contains 


missing values. The goal is to obtain draws of the missing values X, given 
the observed data X, and Z. Note that here all observations on z are used, 
whereas the preceding regression-based imputation used only Zo. 


We apply the data-augmentation method of section 30.4.1, treating Xm 
as additional parameters. The draws from the model’s posterior density 
p(II, ©, X,,|Z, Xo) can be obtained using the Gibbs sampler, with 
alternating draws from the conditional posterior p(II, ©)|X,,, Z, Xo), called 
the P step (posterior step), and the conditional posterior 
p(X |, £, Z, Xo), called the I step (imputation step). The conditional 
posteriors depend on the distribution of the data and the prior distributions 
of the parameters. 


As already noted, we assume that x; ~ N(IIz;, =). A uniform prior is 
assumed for TI, and an inverted Wishart prior is assumed for ©. The 
Wishart prior depends on two parameters, and the Stata default settings for 
these parameters lead to prior n (IL, 5) x D7 +1)/2, where p = dim(x). 


The data-augmentation method in the preceding section’s probit 
example was used to obtain posterior draws of the parameters, so the 
posterior draws of the augmented latent variables were discarded. For 
multiple imputation, we instead discard the posterior draws of the 
parameters and keep the posterior draws of the augmented variables X*,. 
To ensure that the imputed datasets are independent, we follow the default, 
which is to use every hundredth posterior draw. 


30.5.5 The mi import, mi impute, and mi estimate commands 


In some cases, multiply imputed datasets are already available. If these data 
are available as Stata mi data, then the data can be read in using the use 
command. If these multiply imputed data are in other formats, then some 
formats can be read in using the mi import command. 


To instead impute data oneself, one uses the key command mi impute, 
which has format 


mi impute method ... E options | 


The method for regression-based imputation of a single variable includes 
regress for continuous data, int reg for interval recorded continuous data, 
truncreg for truncated normal data, pmm for predictive mean matching of a 
continuous variable, poisson and nbreg for count data, logit for binary 
data, mlogit for unordered multinomial data, and ologit for ordered 
multinomial data. method mvn implements data augmentation imputation of 
one or more variables based on a multivariate normal model, method 
chained implements multivariate imputation of different types of variables 
by using a sequence of univariate full-conditional specifications, and 
usermethod allows you to add your own imputation methods. 


Given multiple-imputation datasets, obtained either by use of the mi 
impute command or from an external source, the actual econometric model 
of interest is fit using the mi estimate command, which has the format 


mi estimate Ls options | : estimation command ... 


where options specific to mi estimate include nimputations (#), the 
number of imputations. 


Other mi declaration commands need to be used ahead of use of the mi 
impute and mi estimate commands. These are illustrated in the example 
below. 


30.6 Multiple-imputation example 


We present in detail a simple example where data on a single regressor are 
MAR. 


We create a dataset of 10,000 observations with y = 0 + x2 + 2%3 + €, 
where z ~ N(0,1), 73 ~ N(0,1), Cor(£2, 273) = 0.5, ande ~ N(0,1). 


. * Create complete data with regressors x2 and x3 correlated 
. clear 


. set seed 10101 
. matrix C = (1, .5\ .5, 1) 


. drawnorm x2 x3, n(10000) corr(C) 
(obs 10,000) 


. generate y = O + 1*x2 + 2*x3 + rnormal (0,1) 


. summarize 
Variable Obs Mean Std. dev. Min Max 
x2 10,000 .0012519 1.00765 -4.119125 3.596714 
x3 10,000 -.0073642 1.009584 -3.410766 3.781023 
y 10,000 -.0254453 2.86018 -10.65987 11.10632 


Ordinary least-squares (OLS) estimation on all 10,000 observations yields 


. * OLS on complete data 
. regress y x2 x3 


Source SS df MS Number of obs = 10,000 
F(2, 9997) = 36513.04 

Model 71948 .605 2 35974.3025 Prob > F = 0.0000 
Residual 9849 . 49785 9,997 .985245359 R-squared = 0.8796 
Adj R-squared = 0.8796 

Total 81798.1028 9,999 8.18062835 Root MSE = . 9926 

y | Coefficient Std. err. t P>ltl [95% conf. interval] 

x2 1.023646 .0114128 89.69 0.000 1.001275 1.046018 

x3 1.990495 .0113909 174.74 0.000 1.968167 2.012824 

_cons -.0120684 .0099264 -1.22 0.224 -.0315261 . 0073893 


The 95% confidence intervals for the intercept and x3 include their DGP 
values. The 95% confidence interval for £2 does not quite include the DGP 


value of 1.0, but this is not of concern because this will happen in 5% of 


simulations. 


Now suppose some observations on £2 are missing, and these are 
missing when 3 takes high values, specifically, if £3 > u, where 


u ~ N(0.5, 0.52). 


* Create incomplete data by dropping some x2 (MAR) 
set seed 12345 


. replace x2 = . if x3 > rnormal(0.5, 0.5) 
(3,282 real changes made, 3,282 to missing) 


summarize 


Variable Obs Mean Std. dev. Min 


Max 


x2 6,718 -.2407734 .9585193 -4.119125 
x3 10,000 - .0073642 1.009584 -3.410766 
y 10,000 - .0254453 2.86018 -10.65987 


. qui save incomplete, replace 


We lose approximately 33% of the observations on £2. 


3.355527 
3.781023 
11.10632 


Because missingness of x2 depends only on “3 and not on y, which 
depends in part on the unobservable £, for analysis of (Y, £2, £3), the data are 
MAR. So OLS regression of y on X2 and x3 with incomplete observations 


dropped will still provide consistent estimates. We obtain 


. * OLS on incomplete data 
. regress y x2 x3 


Source SS df MS Number of obs = 6,718 
F(2, 6715) = 15243.03 

Model 30330 . 9962 2 15165.4981 Prob > F = 0.0000 
Residual 6680 . 84331 6,715 .994913374 R-squared = 0.8195 
Adj R-squared = 0.8194 

Total 37011.8395 6,717 5.51017411 Root MSE = 99745 

y | Coefficient Std. err. t P>|tl [95% conf. interval] 

x2 1.034929 .013885 74.54 0.000 1.00771 1.062148 

x3 1.991878 .0174389 114.22 0.000 1.957692 2.026064 

_cons - .0060802 .0144895 -0.42 0.675 - .0344842 .0223238 


The fitted coefficients are again essentially equal to their DGP values; the 
95% confidence interval for the coefficient of x2 does not quite contain 1.0, 


as was the case for the complete data. Compared with the complete data, the 
standard error of X2 has increased from 0.0114 to 0.0139, and the standard 
error of £3 has increased from 0.0114 to 0.0174. 


If £2 was MCAR, there would be no benefit in imputing the missing 
observations. However, the missingness of £2 was related to values of z3, 
and £2 and £3 are correlated. So there is the possibility of imputing the 
missing values £2, using the observed data on the other variables, and 
improving efficiency of subsequent estimation. 


30.6.1 Describing missing-value patterns 


For regular datasets, the misstable commands can be used to summarize the 
missing patterns in the data. For the current data, we obtain 


. * Summarize missingness 
. misstable summarize 


Obs<. 
Unique 
Variable Obs=. Obs>. Obs<. values Min Max 
x2 3,282 6,718 >500 -4.119125 3.355527 


. misstable patterns 


Missing-value patterns 
(1 means complete) 


Pattern 
Percent 1 
67% 1 
33 0 
100% 


Variables are (1) x2 


Data are missing for 3,282 values of x2 and are available for 6,718 (a total of 
10,000). Thus, 33% are missing and 67% are available. 


Note that if the dataset is already in Stata mi dataset form, then we need 
to instead use the mi misstable commands. The default for these commands 
is to give the missing-data patterns for only the original (nonimputed) data. 


30.6.2 Data imputation 


There are choices to make in imputing the data. First, choosing which 
variables will be used to impute 2. We impute using data on both £3 and y. 


Second, choosing which imputation method is used. Because only a 
single variable is missing, we could use the mi impute regress x2 y x3 
command. This assumes that the conditional distribution of £2 given y and 
X3 is normal. We instead use the more general mi impute mvn x2 = y x3 
command, which can also be used if more than one variable is missing. This 
command assumes that the data (£2, Y, £3) are joint normally distributed, a 
reasonable assumption here, but in other applications, it may be necessary to 
first transform some of the variables. 


Before running the multiple imputation, we first use the mi set mlong 
command to say that the imputed dataset will be stored in the mlong data 
style and use the mi register command to declare which variables have 
missing observations and which have complete observations. 


We then impute only three times to reduce the length of output in this 
illustrative example. We manually set the burn-in at 100 draws, which is 
hopefully enough for the chain to have converged, and keep every 100th 
draw, which hopefully is a big enough separation to ensure that the imputed 
values are independent. (These are the Stata defaults, so their explicit use 
here is for expository purposes.) We obtain 


. * Declare dataset type (long), register imputed variables, perform 3 imputations 
. mi set mlong 


. mi register imputed x2 
(3282 m=0 obs now marked as incomplete) 


. mi register regular y x3 
. mi impute mvn x2 = y x3, add(3) rseed(10101) burnin(100) burnbetween(100) 


Performing EM optimization: 
note: 3282 observations omitted from EM estimation because of all imputation 
variables missing. 
observed log likelihood = -448.04782 at iteration 1 


Performing MCMC data augmentation ... 


Multivariate imputation Imputations = 3 
Multivariate normal regression added = 3 
Imputed: m=1 through m=3 updated = (0) 
Prior: uniform Iterations = 300 
burn-in = 100 
between = 100 


Observations per m 


Variable Complete Incomplete Imputed Total 


x2 6718 3282 3282 10000 


(Complete + Incomplete = Total; Imputed is the minimum across m 
of the number of filled-in observations. ) 


The default uniform prior is used. 


The following data and variables are created. 


* Multiple imputation creates the following data and variables 
summarize 


Variable Obs Mean Std. dev. Min Max 
x2 16,564 .1851142 .9932464 -4.119125 3.747372 

x3 19,846 . 4858939 .9898968 -3.410766 3.781023 

y 19,846 1.205346 2.808125 -10.65987 11.10632 

_mi_m 19,846 . 9922402 1.153583 (0) 3 
_mi_id 19,846 4980.141 2888.991 1 10000 


_mi_miss 10,000 . 3282 . 4695815 (0) 1 


Note that rather than save three datasets each with 30,000 observations, we 
save storage space by storing nonmissing data only once. This leads to the 


creation of the indicator variables mi m, mi _id,and mi miss that appear 
in every mi dataset. 


The indicator variable mi_m equals 0 for the original 10,000 
observations, | for the first set of 3,282 imputed observations, 2 for the 
second set of 3,282 imputed observations, and 3 for the third set of 3,282 
imputed observations. In all, there are 10000 + 3 x 3282 = 19846 
observations. The variable x2 was missing for 3,282 observations, so 
summary statistics report 16,564 (= 19846 — 3282) observed values for x2. 


Variable _mi_id takes value 1 to 10,000 for the original 10,000 
observations and thereafter for imputed data equals the mi ia in the 
original data, so_mi_id gives the identifier for the imputed observations. 


The indicator variable mi miss takes value | if any data are missing in 
the first 10,000 observations and value 0 if data are not missing in the first 
10,000 observations and takes the missing value (.) for the subsequent 
imputed observations. 


As for any MCMC method, there is no guarantee that the chain has 
converged. It is more likely the longer the chain runs, so we might increase 
the burnin(). The Stata manual entry [MI] mi impute mvn demonstrates use 
of diagnostics for convergence of the parameters using results stored in the 
option savewlf (). 


An obvious check of both the imputation model and chain convergence 
is to see whether the imputed values of missing observations are sensible. 
The mi xeq prefix can be used to apply Stata data-descriptive commands 
such as summarize Or tabulate Or kdensity to one or more of the original 
and imputed datasets. 


We summarize the data for the original data as well as the first and third 
imputed datasets. 


. * Check reasonableness of imputed x2 
. mi xeq O 1 3: summarize y x2 x3 


m=O data: 
-> summarize y x2 x3 
Variable Obs Mean Std. dev. Min Max 
y 10,000 -.0254453 2.86018 -10.65987 11.10632 
x2 6,718 -.2407734 .9585193 -4.119125 3.355527 
x3 10,000 -.0073642 1.009584 -3.410766 3.781023 
m=1 data: 
-> summarize y x2 x3 
Variable Obs Mean Std. dev. Min Max 
y 10,000 -.0254453 2.86018 -10.65987 11.10632 
x2 10,000 -.0050421 1.00424 -4.119125 3.586417 
x3 10,000 -.0073642 1.009584 -3.410766 3.781023 
m=3 data: 
-> summarize y x2 x3 
Variable Obs Mean Std. dev. Min Max 
y 10,000 -.0254453 2.86018 -10.65987 11.10632 
x2 10,000 -.0045282 1.00216 -4.119125 3.747372 
x3 10,000 -.0073642 1.009584 -3.410766 3.781023 


As expected, the completely observed variables y and x3 take the same 
values in all datasets. The incompletely observed variable x2 takes a similar 
range of values in the first and third imputed datasets, with mean close to 0 
and standard deviation close to 1. 


The mean of x2 in the completed datasets of 10,000 observations is about 
0.25 standard deviations higher than the mean of x2 in the observed dataset 
of 6,718 observations. In this example, we know the DGP and so expect this 
difference because the missing values of x2 will generally be higher values 
because x2 1s missing if x3 takes higher values and x2 and x3 are positively 
correlated. Without this knowledge, further investigation might be 
warranted. 


The following commands graph the relationship between x2 and x3 for 
the original 6,718 completely observed observations and for the first set of 
3,282 imputed values. 


. * Plot x2 against x3 for nonmissing data and first imputation 
qui graph twoway (scatter x2 x3 if _mi_m==0, msize(tiny)) 
(1fit x2 x3 if _mi_m==0, lwidth(thick)), ytitle("x2") 
title("Nonmissing data: x2 versus x3") legend(off) saving(graph1, replace) 


vv: 


qui graph twoway (scatter x2 x3 if _mi_m==1, msize(tiny)) 
(1fit x2 x3 if _mi_m==1, lwidth(thick)), ytitle("x2") 
title("Imputed data: x2 versus x3") legend(off) saving(graph2, replace) 


Vv: 


graph combine graphi.gph graph2.gph, ycommon xcommon iscale(1.2) 
ysize(2.5) xsize(6) rows(1) 


Ve 


Figure 30.3 presents the results. The relationship between x2 and £3 is 
similar across the two datasets, with the imputed dataset values shifted to the 
right with higher values of x2 and 73. 


Nonmissing data: x2 versus x3 Imputed data: x2 versus x3 


Figure 30.3. A basic scatterplot of log earnings on hours 
30.6.3 Estimation using imputed data 


Before fitting the model using imputed data, we describe the imputed data. 


* Describe the imputed data 
. mi describe 


Style: mlong 


Observations: 
Complete 6,718 
Incomplete 3,282 (M = 3 imputations) 
Total 10,000 

Variables: 


Imputed: 1; x2(3282) 

Passive: 0 

Regular: 2; y x3 

System: 3; _mi_m _mi_id _mi_miss 


(there are no unregistered variables) 


The original data had 10,000 observations of which 6,718 were complete. 
Here 3 sets of imputed values for the 3,282 incomplete observations are 
obtained. We use the mi est imate command to perform OLS estimation in 
each of the 3 complete datasets, each with 6,718 complete observations and 
3,282 imputed observations (with imputation here only of x2). 


. * Fit the model 
. mi estimate, dots: regress y x2 x3 


Imputations (3): 


. done 
Multiple-imputation estimates Imputations = 3 
Linear regression Number of obs = 10,000 
Average RVI = 0.3207 
Largest FMI = 0.5336 
Complete DF = 9997 
DF adjustment: Small sample DF: min = 9.93 
avg = 1,369.10 
max = 2,426.18 
Model F test: Equal FMI F ( 2, 29.6) = 24673.11 
Within VCE type: OLS Prob > F = 0.0000 
y | Coefficient Std. err. t P>|t| [95% conf. interval] 
x2 1.02051 .0154657 65.99 0.000 . 9860178 1.055001 
x3 2.003095 0115281 173.76 0.000 1.980489 2.025701 
_cons - .0049432 .010129 -0.49 0.626 - .0248102 0149237 


The estimated coefficients are close to their DGP values of, respectively, 1, 2, 
and 0. Compared with regression using the incomplete data, the standard 


error of x2 has increased from 0.0139 to 0.0155, and the standard error of £3 
has decreased from 0.0174 to 0.0115. The increase in the standard error of 
x2 is due to increased variability due to few imputations. Below, we increase 
the number of imputations to 10 and find that all the standard errors have 
decreased. 


Inference is based on ¢ statistics and F statistics with degrees of freedom 
that depend in a complicated way on the number of observations, the 
consequences of missing data, and the number of imputations. 


The mi convert command can be used to create separate datasets on the 
original observations and for each of the imputations. These could be 
exported and used by other analysts. 


For the current example, we have 


* Create three complete datasets with imputed 
. mi convert flongsep midata, replace clear 
(files midata.dta _1_midata.dta _2_midata.dta _3_midata.dta created) 


Each of these imputed datasets is composed of nonmissing data on all 
variables plus 3,228 imputed observations for x2. 


The following code demonstrates that running three separate regressions 
on these three imputed datasets and averaging the resulted three estimated 
coefficients leads to the coefficient estimate reported in the preceding mi 
estimate command. We do so for the coefficient of x2. 


. * Perform OLS estimation on each of these three imputed datasets 
. use _i_midata, clear 


. qui regress y x2 x3, noheader 
. scalar bix2 = _b[x2] 

. use _2_midata, clear 

. qui regress y x2 x3, noheader 
. qui scalar b2x2 = _b[x2] 

. use _3_midata, clear 

. qui regress y x2 x3, noheader 
. scalar b3x2 = _b[x2] 


. display "Ave OLS coeff of x2 over 3 imputed samples = " (b1x2 + b2x2 + b3x2)/3 
ive OLS coeff of x2 over 3 imputed samples = 1.0205095 


The average of the 3 coefficients of x2 1s 1.02051, as expected. 


We conclude by repeating the analysis using 10 imputations. 


. * Impute and estimate with 10 imputations 
. use incomplete, clear 


. mi set mlong 


mi register imputed x2 
(3282 m=0 obs now marked as incomplete) 


. mi register regular y x3 
. qui mi impute mvn x2 = y x3, add(10) rseed(10101) burnin(100) burnbetween(100) 
. mi estimate, dots: regress y x2 x3 


Imputations (10): 


Se Mie tegamay ate 10 done 
Multiple-imputation estimates Imputations = 10 
Linear regression Number of obs = 10,000 
Average RVI = 0.7354 
Largest FMI = 0.5129 
Complete DF = 9997 
DF adjustment: Small sample DF: min = 37.55 
avg = 69.79 
max = 117.71 
Model F test: Equal FMI FC 2, 68.6) = 20513.58 
Within VCE type: OLS Prob > F = 0.0000 
y | Coefficient Std. err. t P>ltl [95% conf. interval] 
x2 1.018753 .0134219 75.90 0.000 9921737 1.045333 
x3 1.993356 .0159557 124.93 0.000 1.961043 2.025669 
_cons -.0127999 .012918 -0.99 0.326 - .0386979 .0130981 


Compared with regression using the incomplete data, the standard error of 
x2 has decreased from 0.0139 to 0.0134, and the standard error of 73 has 
decreased from 0.0174 to 0.0160. 


30.7 Additional resources 


The [BAYES] Stata Bayesian Analysis Reference Manual provides a very 
detailed exposition of Bayesian analysis and MCMC computation. The key 
commands are the bayes prefix and the bayesmh command. Bayesian 
analysis is presented in many complete books, including Koop (2003) and 
Gelman et al. (2013), and in Cameron and Trivedi (2005, chap. 13). 


The [m1] Stata Multiple-Imputation Reference Manual covers the 
multiple-imputation commands; see also Cameron and Trivedi (2005, 
chap. 27). 


30.8 Exercises 


1. This question adapts the program in section 30.3 from a probit model 
to a logit model. Use the same DGP and seed as in section 30.2.1, 
except replace the line generate ystar = 0.5*x+rnormal (0,1) with 
the three lines gen u=runiform() and gen ulogistic=1n (u) -1n(1-u) 
and gen ystar=0.5*xtulogistic. Fit by the usual logit MLE, and 
report the results. Fit the logit model with flat priors for the intercept 
and slope parameters using Stata command bayesmh and default 
settings. For the slope coefficient 52, give the posterior mean, standard 
deviation, and a 95% Bayesian posterior credible interval. Compare 
the posterior mean and standard deviation of (62 with the logit MLE of 
6 and its standard error. Comment on any differences between the 
two. Is this what you expect? 

2. Continue the example of question 1. Adapt the Mata code to run MCMC 
under MH for the logit model. Run this code. Note that the Mata 
command invlogit (A) computes A(a,,;) for each element of the 
matrix A. Compare your results with those obtained using the bayesmh 
command. Does your MH method appear to work well? Is there much 
serial correlation between draws? Is the acceptance rate high? Explain 
your answer. 

3. This question adapts the program in section 30.4 from a probit model 
to a tobit model. For simplicity, 52 is known with 52 — 1, so we need 
obtain only the posterior for G. Use the same DGP and seed as in 
section 30.2.1, except replace the line generate y=(ystar>0) with gen 
y=ystar and replace y=0 if ystar<0. Fit the tobit model by the usual 
logit MLE, and report the results. Adapt the Mata code for the Gibbs 
sampler and data augmentation for the probit model to the tobit model 
(with known g2 — 1). Run this code. Compare your results with the ML 
estimates. Does your MCMC method appear to work well? Is there much 
serial correlation between draws? Is the acceptance rate high? Explain 
your answer. 

4. This question adapts the multiple-imputation example. Use the 
command mi impute regress with seed 10101, 10 imputations, and x2 
imputed using data on y and x3. Summarize the imputed datasets, and 


regress y on x2 and x3 using the imputed data. Compare your results 
with those in section 30.6 based on the mi impute mvn command. 

. Generate data using the following commands: clear and set seed 
10101 and matrix C=(1,.5/5,1) and drawnorm x2 x3, n(10000) 
corr (C) and, finally, gen y=0+1*x2+2*x3+rnormal (0,1). Summarize 
the DGP. Estimate by oLs. Now, give commands set seed 12345 and 
replace x2=. if x3>0.5. Is the missing-data mechanism MCAR, MAR, 
or MNAR? For each of the following methods, obtain the OLs coefficient 
estimates and state whether you believe slope coefficients are 
consistently estimated and standard errors are consistently 

estimated: i) case deletion; ii) mean imputation for the missing 
variable; 111) mean imputation for the missing variable plus a dummy 
variable for missing; iv) imputation by regressing the fitted value of 
the regressor on other completely observed regressors and using the 
fitted values from this regression to impute the missing regressor; 

v) multiple imputation using the mi impute regress command with 
seed 10101, 10 imputations, and x2 imputed using data on y and x3. 
Which methods give valid results? 


Glossary of abbreviations 


e 2SLS — two-stage least squares 

e 3SLS — three-stage least squares 

e AFT — accelerated failure time 

e AIC — Akaike information criterion 

e AICC — Akaike information corrected criterion 
e AIPW — augmented inverse-probability weighting 
e AME — average marginal effect 

e AR — Anderson—Rubin 

e AR — autoregressive 

e ARMA — autoregressive moving average 
e ARUM — additive random-utility model 
e ATE — average treatment effect 

e ATET — average treatment effect on the treated 
e AUC — area under the curve 

e BC — bias-corrected 

e BCa — bias-corrected accelerated 

e BIC — Bayesian information criterion 

e BLP— Berry—Levinson—Pakes 

e CCE — common correlated estimator 

e c.d.f. — cumulative distribution function 
e CIF — cumulative incidence function 

e CL — conditional logit 

e CLR — conditional likelihood ratio 

e CQR — conditional quantile regression 

e CRE — correlated random effects 

e CV— cross-validation 

e DGP — data-generating process 

e DIC — deviance information criterion 

e DID — difference in differences 

e DV — dummy variable 

¢ DWH — Durbin—Wu—Hausman 

e ERM— extended regression model 

e ET — endogenous treatment 

e FAQ — frequently asked questions 


FD — first difference 

FDP — false discovery proportion 

FDR — false discovery rate 

FE — fixed effects 

FGLS — feasible generalized least squares 

FMM — finite-mixture model 

FPC— finite-population correction 

FRD— fuzzy regression discontinuity 

FWER — familywise error rate 

GAM — generalized additive models 

GLM — generalized linear models 

GLS — generalized least squares 

GMM — generalized method of moments 
GS2SLS — generalized spatial two-stage least squares 
GSEM— generalized structural equation model 
GUI— graphical user interface 

HAC — heteroskedasticity- and autocorrelation-consistent 
HRS— Health and Retirement Study 

ITA — independence of irrelevant alternatives 

i.i.d. — independent and identically distributed 

IM — information matrix 

IPW — inverse-probability weighting 

IPW-RA — inverse probability with regression adjustment 
ITT — intention to treat 

IV — instrumental variables 

JIVE — jackknife instrumental-variables estimator 
LATE — local average treatment effect 

LEF — linear exponential family 

LIML — limited-information maximum likelihood 
LM — Lagrange multiplier 

LOOCV — leave-one-out cross-validation 

LPM — linear probability model 

LR — likelihood ratio 

LS — least squares 

LSDV — least-squares dummy variable 

MA — moving average 

MAR — missing at random 


MCAR — missing completely at random 
MCMC — Markov chain Monte Carlo 

MD — minimum distance 

ME — marginal effect 

MEM — marginal effect at mean 

MER — marginal effect at representative value 
MG — mean group 

MH — Metropolis—Hastings 

ML — maximum likelihood 

MLE — maximum likelihood estimator 
MLT— multilevel treatment 

MM — method of moments 

MNAR — missing not at random 

MNL — multinomial logit 

MNP — multinomial probit 

MSE — mean squared error 

MSL — maximum simulated likelihood 

MSS — model sum of squares 

MTE — marginal treatment effect 

NB — negative binomial 

NB1 — negative binomial variance linear in mean 
NB2 — negative binomial variance quadratic in mean 
NL — nested logit 

NLS — nonlinear least squares 

NNM — nearest-neighbor matching 

NR — Newton-—Raphson 

NSW — National Supported Work 

OHIE — Oregon Health Insurance Experiment 
OHP — Oregon Health Program 

OLS — ordinary least squares 

PA — population averaged 

PFGLS — pooled feasible generalized least squares 
PH — proportional hazards 

PM — predictive mean 

POM — potential-outcome mean 

PSID — Panel Study of Income Dynamics 
PSM — propensity-score matching 


PSU — primary sampling unit 

QCR — quantile count regression 

QR — quantile regression 

QTE — quantile treatment effect 

RA — regression adjustment 

RCT — randomized control trials 

RD — regression discontinuity 

RE — random effects 

RIF — recentered influence function 
RMSE — root mean squared error 

ROC — receiver operator characteristics 
RPL — random-parameters logit 

RSS — residual sum of squares 

SAR — spatial autoregressive 

SARAR — autoregressive spatial autoregressive 
SEM — structural equation model 

SJ — Stata Journal 

SRD — sharp regression discontinuity 
STB — Stata Technical Bulletin 

SUR — seemingly unrelated regressions 
TE — treatment effect 

TSS — total sum of squares 

VCE — variance—covariance matrix of the estimator 
WLS — weighted least squares 

ZINB — zero-inflated negative binomial 
ZIP — zero-inflated Poisson 

ZTNB — zero-truncated negative binomial 
ZTP — zero-truncated Poisson 
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NB2 model, 20.2.2 , 20.3.3 , 20.3.4 
negative binomial distribution, 20.2.2 
negative binomial model, 20.3.3 , 20.3.4 


nonlinear least squares, 20.3.6 , 20.3.6 
overdispersion, 20.2.2 
overfitting, 20.5.11 
panel-data estimators, 22.6 , 22.6.9 
point-mass distribution, 20.6.5 , 23.3.6 , 23.3.6 
Poisson distribution, 20.2 
Poisson for continuous nonnegative dependent variable, 20.2.4 
Poisson mixed effects, 23.4.2 , 23.4.3 
Poisson model, 20.3.2 , 20.3.4 
quantile count regression, 20.9 , 20.9.5 
robust standard errors, 20.3.2 , 20.3.2 
shrinkage estimation, 28.4.7 , 28.4.7 
test of endogeneity, 20.7.2 
test of overdispersion, 20.3.2 
treatment effects, 25.4.3 , 25.4.5 
truncated Poisson and NB2, 20.3.7 
two-part model, 20.4.1 , 20.4.3 
unobserved heterogeneity, 20.2.2 
weak instruments, 20.7.4 
zero-inflated models, 20.6 , 20.6.6 
zero-truncated models, 20.3.7 
counterfactual command, 25.9.3 
countfit command, 20.3.4 
Cox proportional hazards model, see duration models 
cpoisson command, 20.3.7 
cquad commands, 22.4.13 
cross-fit partialing-out estimator, 28.8.9 , 28.8.9 
crossfold command, 28.2.6 
cross-validation, 28.2.5 , 28.2.8 
K-fold, 28.2.6 , 28.2.6 
leave-one-out, 27.2.4 , 28.2.7 
single split, 28.2.5 
test sample, 28.2.5 
training sample, 28.2.5 
cumulative hazard function, 21.3.3 
Nelson—Aalen estimate, 21.3.3 
cvplot command, 28.4.3 


D 
data augmentation, 30.4 , 30.4.4 , 30.5.4 , 30.6.3 
for multiple imputation, 30.5.4 , 30.6.3 
data summary examples 
for duration data, 21.2.1 , 21.3.3 
for multinomial data, 18.3 , 18.3.3 
dataset description 
Acemoglu—Johnson—Robinson data for lasso example, 28.9.3 
American Community Survey earnings Bayesian example, 29.2.2 
cigarette sales synthetic control example, 25.6.2 
fishing-mode multinomial data, 18.3.1 
homicide spatial data example, 26.3 
HRS private health insurance binary data, 17.4.1 
Medical Expenditure Panel Survey 
ambulatory for tobit, 19.3.1 
doctor-visits count data, 20.3.1 
emergency room visits, 20.6.1 
for count QR example, 20.9.3 
for GSEM example, 23.6.7 
Medicare multivalued treatment-effects example, 24.10.2 
Medicine in Australia: 
Balancing Employment and Life panel attrition example, 19.11.1 
OHIE treatment-effects example, 24.8.2 
Rand HIE nonlinear panel example, 22.3 
Senate elections for regression discontinuity design example, 25.7.3 
unemployment duration example, 21.2.1 
defier, 25.5.1 
design-based inference, 24.4.7 
deviance residual, 28.4.7 
didregress command, 25.6.1 
difference in differences, 25.6 , 25.6.1 
parallel trends assumption, 25.6.1 
dimension reduction principal components, 28.5.1 , 28.5.1 
discrete-time hazards model, 21.7 , 21.9 
discrim command, 28.6.7 
discriminant analysis, 28.6.7 
double selection estimator, 28.8.10 , 28.9 


doubly robust ATE estimator, 24.6.5 
doubly robust methods, see treatment effects 
dslogit command, 28.9.2 
dspoisson command, 28.9.1 
dsregress command, 28.8.10 
duration data, see duration models 
duration models, 21 , 21.11 
accelerated failure-time models, 21.5.5 , 21.5.7 
censoring, 21.1 
clustered data, 21.9 
competing risks, 21.4.7 , 21.4.7 
complete spells, 21.1 
Cox proportional hazards model, 21.4.2 , 21.4.6 , 21.7.1 
cumulative hazard function, 21.3.3 
data summary, 21.2 , 21.2.3 
diagnostics, 21.4.6 , 21.4.6 
discrete-time hazards model, 21.7 , 21.9 
exponential model, 21.1 , 21.5.4 
frailty in models, 21.5.8 , 21.5.8 
fully parametric models, 21.5 , 21.5.8 
generalized gamma model, 21 5 4 
Gompertz model, 21.5.4 
hazard function, 21.3.3 
Kaplan—Meier estimate, 21.3.2 
log-integrated hazard curve, 21.4.6 
log—log survivor curve, 21.4.6 
loglogistic model, 21.5.5 
lognormal model, 21.5.5 
marginal effects, 21.5.3 , 21.5.8 , 21.7.1 
multiple-records data, 21.6 , 21.8 
Nelson—Aalen estimate, 21.3.3 
prediction, 21.4.4 , 21.5.3 
proportional hazards model, 21.4.2 , 21.5.4, 21.5.4, 21.5.5 
Schoenfeld residual, 21.4.6 
survivor function, 21.3.2 
time-varying regressors, 21.8 
Weibull FMM application, 23.3.7 , 23.3.7 


Weibull model, 21.5.2 


E 


eintreg command, 19.9.2 , 19.12 , 24.7.2 , 25.3.1 
elastic net, 28.3.3 , 28.3.3 , 28.4.5 , 28.4.5 
elasticnet command, 28.4.1 , 28.4.5 
endogenous regressors 
endogenous-switching regression model, 19.9.3 
ERM commands, 23.7 , 23.7.3 , 25.3 , 25.3.3 
ET commands, 25.4 , 25.4.6 
in binary outcome models, 17.9 , 17.9.5 


ee ) 


in count-data models, 20.7 , 20.7.4 

in tobit model, 19.9 , 19.9.3 

nonlinear instrumental variables, 16.8.1 

panel data, 22.8 , 22.8 

quantile treatment effects, 25.8 , 25.8.4 

structural equation model, 23.6.5 , 23.6.6 

treatment effects, 25 , 25.11 
endogenous-switching regression model, 19.9.3 
eoprobit command, 23.7.1 , 24.7.2 , 25.3.1 
eprobit command, 17.9.3 , 23.7.1 , 24.7.2 , 25.3.1 


ERM commands, 23.7 , 23.7.1 , 23.7.3 
for endogenous treatment, 25.3 , 25.3.3 
for exogenous treatment, 24.7.2 , 24.7.2 
for panel data, 22.8 


estat 


a 


ei 


lternat 


tives command, 18.5.7 


assification command, 17.5.3 


correlat 
covariance command, 18.7.5 
gof command, 17.5.2 
hettest command, 19.5.3 


tion command, 18.7.5 


impact command, 26.7.3 


lcmean command, 23.3. 


— 


lcprob command, 23.3.1 


mfx command, 18.5.8 


moran command, 

phtest command, 21.4.6 

teffects command, r T2 233l 
estimation commands 

multinomial summary, 18.2.6 

nonlinear panel summary, 22.2.6 
ET commands for endogenous treatment, 25.4 , 25.4.6 
eteffects command, 25.4.6 
etpoisson command, 25.4.3 
etregress command, 25.4 
event study models, 25.6.1 
exponential model, 21.5.4 
extended regression models, see ERM commands 
extreat () option, 24.7.2 


F 

failure data, see duration models 

femlogit command, 18.10 

finite mixture models, 23.2 , 23.3.8 
computational method, 23.2.2 
count-data regression, 20.5 , 20.5.11 
definition, 23.2.1 
gamma regression, 23.3.1 , 23.3.1 
logit regression, 23.3.2 , 23.3.2 
marginal effects, 23.3.1 
multinomial logit, 23.3.3 , 23.3.3 
multiple response example, 23.6.7 , 23.6.7 
point-mass distribution, 20.6.5 , 23.3.6 , 23.3.6 
Poisson regression, 23.3.5 , 23.3.5 
predicted class posterior probabilities, 23.3.1 
predicted class probabilities, 23.3.1 
predicted component means, 23.3.1 
tobit regression, 23.3.4 , 23.3.4 
varying mixture probabilities, 23.2.1 
Weibull regression, 23.3.7 , 23.3.7 

fmm prefix, 20.5.3 , 23.2.2 

foreach command, 19.4.6 


fracreg logit command, 17.10.2 
fracreg probit command, L712 
fractional response data regression, 17.3.5 , 17.10.2 


G 
gam command, 27.8 
gamma regression model, 23.3.1 , 23.3.1 
FMM application, 23.3.1 , 23.3.1 
generalized additive model, 27.8 , 27.8 
generalized estimating equations, 22.2.3 , 22.4.4 
generalized gamma model, 21.5.4 
generalized linear models 
logit, 17.4.8 
mixed effects, 23.4.1 , 23.4.3 
probit, 17.4.8 
generalized method of moments 
count-data example, 20.7.3 , 20.7.4 
nonlinear example, 16.8 , 16.8.2 
generalized SEM, 23.6 , 23.6.7 
geospatial data, 26.3 , 26.3.4 
Gibbs sampling algorithm, see Markov chain Monte Carlo methods 
glm command, 17.4.8 
gmm command, 20.7.3 
gnbreg command, 20.3.5 
gologit2 command, 18.9.6 
Gompertz model, 21.5.4 
gradient methods, see iterative methods 
graphs 
Bayesian diagnostics, 29.4.4 
binary outcome plot, 17.5.4 
nonparametric regression, 27.4.5 
grmap command, 26.3.3 
grouped data binary outcome models, 17.10 , 17.10.2 
gsem command, 23.6.2 
hurdle model, 20.4.3 
gvselect command, 28.2.8 


H 
hazard function, 21.3.3 
heckman command, 19.6.2 
eregress command equivalent, 19.6.2 , 23.7.1 
eckmancopula command, 19.7.2 
eckoprobit command, 19.6.2 


eckpoisson command, 19.6.2 


eckprobit command, 19.6.2 


etoprobit command, 18.9.6 


etprobit command, 17.8.1 


h 
h 
h 
h 
h 
h 


hierarchical models, 29.8 , 29.8.2 

hnblogit command, 20.4.3 

hplogit command, 20.4.2 

hurdle model, see count-data models 
GSEM estimation, 20.4.3 

hypothesis tests 


permutation tests, 24.3.1 , 25.6.2 


I 
identification and nonconvergence, 16.3.5 
ignorability, see unconfoundedness 
imputation 
inverse-probability weighting, 19.11.3 , 19.11.3 
multiple imputation, 30.5 , 30.6.3 
regression-based, 30.5.3 
sample selection, 19.11.4 , 19.11.4 
summary of methods, 19.10.5 
information criteria, 20.5.10 , 23.2.1 , 28.2.3 , 28.3.4 
instrumental variables 
control function estimator, 17.9.3 , 20.7.2 , 20.7.2 , 25.4.6 
in binary outcome models, 17.9.4 , 17.9.5 
in count-data models, 20.7 , 20.7.4 
in spatial regression models, 26.7.2 , 26.8 , 26.8 
lasso, 28.9.3 , 28.9.3 
nonlinear IV, 16.8.1 , 16.8.2 , 17.9.5 
quantile regression, 25.8 , 25.8.4 
structural model approach, 17.9.2 , 17.9.3 , 20.7 , 20.7.4 


weak instruments, 17.9.3 , 20.7.4 
intention-to-treat effect, 24.8.1 , 25.5.4 
interval regression, see tobit models 


intreg command, 19.3.6 


24.6.2 
IPW regression adjustment, see treatment effects 
iterative methods, 16 , 16.10 
checking analytical derivatives, 16.6.2 
checking data, 16.6.3 
checking parameter estimates, 16.6.5 
checking program, 16.6 , 16.6.6 
checking standard errors, 16.6.6 
constraints on parameters, 16.3.8 
derivative methods, 16.4.1 
evaluator types, 16.4.1 , 16.7.1 
GMM example in Mata, 16.8.2 , 16.8.2 
gradient methods, 16.3 , 16.3.8 
linear-form methods, 16.4.1 
Mata example, 16.2.3 , 16.2.3 
maximization options, 16.3.1 
messages during iterations, 16.3.3 
ml command methods, 16.5.2 , 16.7 , 16.7.4 
multicollinearity, 16.6.4 
multiple optimums, 16.3.6 
Newton-Raphson method, 16.2 , 16.2.3 
NLS example, 16.5.5 
nonconvergence, 16.3.5 
nonlinear IV example in Mata, 16.8.2 , 16.8.2 
not identified, 16.3.5 
NR example in Mata, 16.2.3 , 16.2.3 
numerical derivatives, 16.3.7 
optimization in Mata, 16.8 , 16.8.2 
optimization techniques, 16.3.2 , 16.4.2 
overview of optimization tools, 16.4 , 16.4.2 
Poisson gradient and Hessian, 16.2.2 
quadratic form methods, 16.4.1 


single-index model example, 16.5.3 , 16.5.3 
step-size adjustment, 16.3.2 
stopping criteria, 16.3.4 
two-index model example, 16.5.4 , 16.5.4 
ivpoisson gmm command, 20.7.3 3 20.7.3 
ivprobit command, 17.9.3 
ivgte command, 25.8.2 
ivregress command, 17.9.4 
ivtobit command, 19.9.1 


K 

Kaplan-Meier estimate, 21.3.2 , 21.3.3 
kernel regression, see nonparametric methods 
K-fold cross-validation, 28.2.6 , 28.2.6 
kmatch command, 24.12 

k-means clustering, 28.6.8 


L 

lars command, 28.3.2 

lasso, 28.3.2 , 28.3.2 , see also machine learning 
adaptive lasso, 28.4.4 , 28.4.4 
application, 28.4 , 28.4.7 , 28.7.2 
clustered data, 28.4.1 , 28.4.4 , 28.7.2 
distribution of estimators, 28.3.4 , 28.3.4 
elastic net, 28.3.3 , 28.3.3 , 28.4.5 , 28.4.5 
inference, 28.8 , 28.9.4 , 28.10 
linear command, 28.4.1 
linear example, 28.4.2 , 28.4.5 
logit, 28.4.7 , 28.4.7 
logit command, 28.4.7 
oracle property, 28.3.4 
Poisson, 28.4.7 , 28.4.7 
poisson command, 28.4.7 
postlasso OLS, 28.3.4 , 28.4.3 
probit, 28.4.7 , 28.4.7 
probit command, 28.4.7 
selection (bic), 28.4.4 , 28.7.2 


sparsity assumption, 28.8.1 
lassocoef command, 28.4.3 
lassogof command, 28.4.3 4.3 
lassoinfo command, 28.4.3 


lassoknots command, 28.4.3 
lassoselect command, 2 4.3 
LATE, 25.5, 25.5.5 
application, 25.5.4 , 25.5.4 
assumptions, 25.5.2 
complier, 25.5.1 
marginal treatment effects, 25.5.5 
monotonicity, 25.5.2 
Wald estimator, 25.5.1 
least-absolute-deviations regression 
censored, 19.3.6 
least-angle regression, 28.3.2 
leave-one-out cross-validation, 27.2.4 , 28.2.7 
linear discriminant analysis, 28.6.7 
linear probability model, 17.2.2 , 17.3.4 
linear regression model 
Bayesian example, 29.2 , 29.2.2 , 29.5 , 29.5.3 
lasso example, 28.4.2 , 28.4.5 
linear probability model, 17.3 


17.3.4 
local average treatment effects, see LATE 
local constant regression, 27.2.1 
local linear regression, 27.2.2 
local polynomial regression, 25.7.5 
locreg command, 17.8.4 
log-integrated hazard curve, 21.4.6 
logistic command, 17.3.2 
logit command, 17.3.2 
logit model, see binary_outcome models 
logitfe command, 22.4.8 
log-linear regression 

Poisson regression alternative, 20.2.4 
log-log survivor curve, 21.4.6 
loglogistic model, 21.5.5 


lognormal data 

tobit model, 19.4 , 19.4.7 
lognormal durations model, 21.5.5 
long-form data, 18.5.1 , 18.5.1 
loocv command, 28.2.7 
lpoly command 
compared with npregress, 27.4.6 
lroc command, 17.5.5 


M 

machine learning, 28 , 28.11 
augmented IPW, 28.9.4 
bagging, 28.6.4 
big data, 28.6 
boosting, 28.6.6 , 28.7.2 
causal analysis, 28.8 , 28.10 
classification, 28.6 , 28.6.7 , 28.6.7 
cross-fit partialing out, 28.8.9 , 28.8.9 
double or debiased, 28.8.9 , 28.8.9 
double selection, 28.8.10 , 28.9 
inference, 28 , 28.11 
instrumental variables, 28.9.3 , 28.9.3 
lasso, 28.3.2 , 28.4.7 , 28.8 , 28.10 
logit, 28.4.7 , 28.4.7 , 28.9.2 , 28.9.2 
neural networks, 28.6.2 , 28.6.2 , 28.7.2 
oracle property, 28.3.4 
orthogonalization, 28.8.8 , 28.8.8 
overview, 28.6 , 28.6.8 
partial linear model, 28.8.1 , 28.8.10 
partialing out, 28.8.3 , 28.8.8 
Poisson, 28.4.7 , 28.4.7 , 28.9.1 , 28.9.1 
prediction application, 28.7 , 28.7.3 
random forests, 28.6.5 , 28.7.2 
regression trees, 28.6.3 , 28.6.3 
shrinkage estimators, 28.3 , 28.4.7 
supervised learning, 28.6 , 28.6.1 , 28.6.7 
support vector machines, 28.6.7 


unsupervised learning, 28.6 , 28.6.8 , 28.6.8 
Mallow’s Cp measure, 28.2.3 
marginal effects 
in Bayesian model, 29.11.3 , 29.11.3 
in binary outcome models, 17.6 , 17.6.4 
in count-data models, 20.3.2 
in linear mixed models, 23.4.3 
in panel logit model, 22.4.11 , 22.4.11 
in panel ordered logit model, 22.4.15 
in spatial regression, 26.7.3 , 26.7.3 
in tobit model, 19.3.4 , 19.3.4 
in treatment-effects models, 25.4.1 , 25.4.5 
manual computation, 25.4.5 
marginal likelihood, 29.3.1 
computation, 29.9.1 
marginal treatment effects, 25.5.5 
margins command, 25.2.4 š 21.4.4 : 21.4.7 
margins, 
predict (cte) command, 25.4.1 
subpop () command, 25.4.1 
marginsplot command, 27.4.5 
Markov chain Monte Carlo methods, 29 , 29.11.33 
acceptance rate, 29.4.2 
acceptance rule, 29.3.3 , 30.3.2 
blocking parameters, 29.7.1 , 29.7.1 
burn-in draws, 29.3.2 
convergence, 29.4.4 , 29.6.5 
data augmentation, 30.4 , 30.4.4 
diagnostics, 29.4.4 , 30.3.3 , 30.4.3 
different starting values, 29.4.5 
efficiency, 29.4.2 , 30.3.3 , 30.4.3 
efficiency of draws, 29.6.3 
efficiency statistic, 29.4.4 
for multiple imputation, 30.5.4 , 30.6.3 
Gelman—Rubin statistic, 29.4.5 
Gibbs sampler example in Mata, 30.4 , 30.4.4 


Gibbs sampling algorithm, 29.3.5 , 29.3.5 
Gibbs within Metropolis—Hastings, 29.7.2 , 29.7.2 
Metropolis algorithm, 29.3.3 
Metropolis—Hastings algorithm, 29.3.3 , 29.3.3 
Metropolis—Hastings algorithm in Mata, 30.3 , 30.3.3 
Monte Carlo simulation error, 29.4.4 
multiple chains, 29.4.5 , 29.4.5 
pitfalls, 29.6.4 , 29.6.5 
posterior draws, 29.3.2 , 29.4.4 , 29.4.7 
prediction, 29.10 , 29.10.3 
proposal density, 29.3.3 
random-walk Metropolis—Hastings algorithm, 29.3.3 
sensitivity analysis, 29.4.5 
Mata 
Bayesian model AME, 29.11.3 , 29.11.3 
data augmentation example, 30.4 , 30.4.4 
Gibbs sampler example, 30.4 , 30.4.4 
GMM example, 16.8.2 , 16.8.2 
Metropolis—Hastings example, 30.3 , 30.3.3 
Newton-Raphson example, 16.2.3 , 16.2.3 
optimization functions, 16.8 , 16.8.2 
Poisson regression example, 16.2.3 , 16.2.3 
matching methods, see treatment effects 
maximization, see iterative methods 
maximum likelihood 
computational methods, 16 , 16.7.4 
maximum simulated likelihood estimator, 18.7.3 , 18.7.3 
MCMC, see Markov chain Monte Carlo methods 
mean absolute error, 28.2.6 
mean squared error, 28.2.2 
measurement-error example, 23.5.7 , 23.5.8 
meglm command, 23.4.1 , 23.4.2 
meintreg command, 19.3.7 3.7 
menbreg command, 23.4.2 
meologit command, 18.10 
meoprobit command, 18.10 
mepoisson comman 


d, 
d, 23.4.2 


mestreg command, 21.9 
metobit command, 19.3.7 
Metropolis—Hastings algorithm, see Markov chain Monte Carlo methods 
mi 
convert command, 30.6.3 
estimate command, 30.5.5 
impute command, 30.5.5 
misstable command, 30.6.1 
register command, 30.6.2 
xeq prefix, 30.6.2 
minimization, see iterative methods 
missing at random, 19.10.1 , 30.5.1 
missing completely at random, 19.10.1 , 30.5.1 
missing data, 19.10 , 19.10.5 
complete-case analysis, 19.10.2 
endogenous sample selection, 19.10.4 
imputation methods, 19.10.5 , 30.5 , 30.6.3 
inverse-probability weighting, 19.10.3 , 19.10.3 
MAR, 19.10.1 , 30.5.1 
MCAR, 19.10.1 , 30.5.1 
mechanisms, 19.10.1 , 19.10.1 
MNAR, 19.10.1 , 30.5.1 
multiple imputation, 30.5 , 30.6.3 
panel attrition, 19.11 , 19.11.5 
sample selection, 19.11.4 , 19.11.4 
missing not at random, 19.10.1 , 30.5.1 
missing-values imputation, 30.5 , 30.6.3 
mixed models 
generalized linear model, 23.4.1 , 23.4.3 
nonlinear, 22.4.12 , 23.4 , 23.4.3 
Poisson example, 23.4.2 , 23.4.3 
mixlogit command, 18.8.2 
ml 
check command, 16.5.3 , 16.6.1 
commands, 16.5 , 16.7.4 
maximize command, 16.5.1 
method a2, 16.7.3 


method g£0, 16.7.4 
method ir, 16.5.2 

method 1£0, 16.7, 
method 1£1, 16.7.2 
method 1£2, 16.7.2 
model command, 16.5.1 , 


search command, 16.5.3 

trace command, 16.6.1 
leval command, 16.7.1 
lmatbysum command, 16.7.3 
lmatsum command, 16.7.1 , 16.7.3 
logit command, 18.4.1 
lsum command, 16.7.1 , 16.7.3 
lvecsum command, 16.7.1 , 16.7.3 


model selection, 28.2 , 28.2.8 


16.7.1 


aa anap 


based on predictive ability, 28.2 , 28.2.8 


Bayesian, 29.9 , 29.9.2 

best subsets, 28.2.8 

conservative model selection, 28.3.4 

consistent model selection, 28.3.4 

count-data models, 20.5.10 

cross-validation, 28.2.5 , 28.2.6 
information criteria, 28.2.3 
stepwise selection, 28.2.8 

mprobit command, 18.7.2 

mtebinary command, 25.5.5 

mtefe command, 25.5.5 

multicollinearity, 16.6.4 


multinomial outcome models, 18 , 18.13 


additive random-utility model, 18.2.5, 


18.7.3 , 18.8.3 
basic theory, 18.2.1 , 18.2.3 


18.6 


bivariate probit model, 18.11.1 , 18.11.1 


BLP market demand, 18.8.4 


clustered data, 18.10 , 18.10 


coefficient interpretation, 18.4.3 , 18.5.6 
conditional logit model, 18.5 , 18.5.10 
conditional versus multinomial logit, 18.5.5 
data example, 18.3 , 18.3.3 

FMM application, 23.3.3 , 23.3.3 
Geweke—Hajivassiliou-Keane simulator, 18.7.3 
independence of irrelevant alternatives, 18.6.1 
long-form data, 18.5.1 , 18.5.1 


maximum likelihood estimator, 18.2.2 

maximum simulated likelihood estimator, 18.7.3 , 18.7.3 
model comparison, 18.6.7 

multinomial logit model, 18.4 , 18.4.5 

multinomial probit model, 18.7 , 18.7.1 , 18.7.7 


multivariate outcomes, 18.11 , 18.11.2 
nested logit model, 18.6 , 18.6.7 


ordered logit model, 18.9.2 
ordered outcome model 
ordered probit model, 1 
overview, 18.2 , 18.2.6 


18.9 , 18.9.6 


S, 
8.9.2 


prediction for new alternative, 18.5.7 
probabilities, 18.2.1 
random-parameters logit, 18.8 , 18.8.4 
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balanced panel, 19.11.2 , 19.11.2 
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nonlinear models, 28.8.3 
pca command, 28.5.1 
poisson command, 20.2 
Poisson model, see count-data models 
poivregress command, 28.9.3 
population-averaged model, see panel data 
poregress command, 28.8.4 
posterior distribution, 29.3.1 , 29.3.1 
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probit command, 17.3.2 
probit model, see binary outcome models 
probitfe command, 22.4.8 
programs 
checking program, 16.6 , 16.6.6 
propensity score, see also treatment effects 
definition, 24.5 
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spatial HAC standard errors, 26.6.1 , 26.6.1 
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svmachines command, 28.6.7 
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panel-data estimators, 22.5 , 22.5.5 
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vce (unconditional) option, 23.7.1 , 25.3.1 , 25.4.1 
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xtbalance command, 19.11.2 
tcloglog command, 22.4 
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